Skip to content

Loki Block Disk Space Cleanup Proposal #6876

Open
@Abuelodelanada

Description

@Abuelodelanada

Is your feature request related to a problem? Please describe.

Currently, Loki users who are not using object-based storage may accidentally fill all available space on a disk or PVC with logs on block-based storage. This is especially common with Loki running in small adhoc k8s clusters like microk8s.

Loki provides endpoints for deleting logs based on dates or queries, but no “automatic” method for ensuring that block storage remains within a defined threshold.

If block storage fills, Loki will crash #2314 and require manual intervention. Several issues have been opened requesting a feature to manage this, as existing workarounds (cronjobs, scheduled runners, etc) require rebuilding indexes. (See this and this)

Describe the solution you'd like

New config option: size_based_retention_percentage

Similar to –storage.tsdb.retention.size in Prometheus, it should be possible to programmatically configure a size based retention policy for Loki.

Based on the complete local example, a new option size_based_retention_percentage can be added:

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks
    size_based_retention_percentage: 80

New argument: -store.size_based_retention_percentage

Loki binary has already a time based retention argument:

$ /usr/bin/loki -h | grep -A1 "store.retention"
  -store.retention value
        How long before chunks will be deleted from the store. (requires compactor retention enabled). (default 31d)

Following this example, the size_based_retention_percentage in the config file can be translated into -store.size_based_retention_percentage

Possible Implementation

Loki’s chunk-tool already contains much of the logic necessary to determine the “real” storage usage from given chunks, and the datetimes for each. Extending and integrating this into an experimental argument to Loki which runs a Ticker. This Ticker can check the disk usage for block volumes, if present in the configuration, and if they exceed the interval, calculate the date/time ranges to delete via the endpoint to bring it back to a threshold.

This Timer could potentially expose a new metric on how fast disk usage is growing as well (loki_chunk_store_bytes_minutes), and emit logs from Loki itself if the log growth rate is sufficiently high to repeatedly trigger pruning.

Optionally, behavior similar to ceph (to determine how aggressive pruning should be) may be possible.

This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period.

Tentative deletion process

  • Check disk usage
    • Re-calculate and update loki_chunk_store_bytes_minutes metric:
      • (actual_disk_usage - last_disk_usage) / minutes_between_reads
    • If disk usage is greater than the size_based_retention_percentage:
      • Calculate the amount of data (bytes) to be deleted:
        • (actual_disk_usage_percentage - size_based_retention_percentage) * disk_size
      • Iterate over chunks until the sum of the chunk's “Data length” is greater than the amount of data to be deleted and get the “Through” value of the last chunk.
      • Perform a log deletion using that “Through” value as “end” parameter.

Describe alternatives you've considered

We thought about implementing an external process to delete old logs based on the size of the volume like this and this, but seems too hacky for a professional use case.

Additional context

This is a WIP feature proposal, please feel free to share your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions