Skip to content

feat(v2): metadata index retention policy #4148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Apr 28, 2025

The change adds a new component that is responsible for enforcement of the retention policy.

It's implemented as follows:

  1. On the leader node, we periodically check which partition shards have passed the retention period threshold. The state is accessed via the Leader Read interface: it always reads the local state.
  2. The component creates tombstones for the partition shards (not the blocks they contain) and issues a new raft command – truncate index.
  3. The command handler removes specified partition shards from the index.
  4. Tombstones then follow the standard tombstones's life-cycle: eventually, they are included into a compaction job.
  5. Once the compaction job is added to the scheduler, we remove the associated tombstone object.
  6. A compaction worker deletes all the objects satisfying the prefix that corresponds the partition tombstone (it includes the partition timestamp).

I added a metric to keep track of the tombstone queue, and a metric to keep track of blocks deleted in compaction worker:

  • pyroscope_metastore_index_tombstones (gauge)
  • pyroscope_compaction_worker_blocks_deleted_total (counter)

The change introduces a new tenant override – retention-period. It works similarly to compactor.blocks-retention-period, but the latter takes precedence if both are set. In the future, we should consider setting a default retention policy; currently, it's unset, and many users aren't aware of this option. Regarding naming: I'm fine with adding a prefix like metastore., but personally, I find it a bit confusing (same with compactor.. Users shouldn't need to know which component implements the option: this option is applied to the tenant, and not to the component. Anyway, I propose to carefully review the CLI and config options later, when we prepare for the release.


To implement the change, I removed the in-memory list of partitions due to the risk of violating transaction isolation. This makes it much easier to reason about the version observed by both writers and readers. Performance testing with synthetic cases and real data showed no negative impact in practice.


The PR also fixes a bug where tombstones could be overwritten due to key collisions (based on Raft command index + append timestamp) at the storage level. This led to some leftover data, which will be cleaned up once a retention policy is enabled.


The change should be rolled out in two steps:

  1. Update the version: all replicas must be updated before the next step.
  2. Enable retention policy via setting the cleanup interval -metastore.index.cleanup-interval=1m.

The change is not forward-compatible: if rolled back, the old FSM won't be able to handle new "unknown" Raft commands: the node will crash trying to reply the WAL, and the rollback won't progress.

@kolesnikovae kolesnikovae changed the title add basic implementation Apr 28, 2025
Comment on lines +89 to +93
logger: logger,
config: cfg,
store: s,
shards: newShardCache(cfg.ShardCacheSize, s),
blocks: newBlockCache(cfg.BlockReadCacheSize, cfg.BlockWriteCacheSize),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally removed the in-memory partition list – it's much cleaner now. Otherwise, we would eventually violate transaction isolation guarantees by exchanging state through the list

@kolesnikovae kolesnikovae marked this pull request as ready for review June 30, 2025 08:47
@kolesnikovae kolesnikovae force-pushed the feat/metastore-index-truncation branch from 263d179 to 7ca1578 Compare June 30, 2025 08:57
@kolesnikovae kolesnikovae force-pushed the feat/metastore-index-truncation branch from eac7b8c to 208bf25 Compare June 30, 2025 12:55
@kolesnikovae kolesnikovae force-pushed the feat/metastore-index-truncation branch 2 times, most recently from bd9d32a to 59cfbd8 Compare June 30, 2025 15:47
@kolesnikovae kolesnikovae force-pushed the feat/metastore-index-truncation branch from 59cfbd8 to a453844 Compare June 30, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant