Skip to content

Allow fast blob-cache introspection by shard-id#138282

Merged
henningandersen merged 8 commits intoelastic:mainfrom
henningandersen:blob_cache_by_shard
Dec 3, 2025
Merged

Allow fast blob-cache introspection by shard-id#138282
henningandersen merged 8 commits intoelastic:mainfrom
henningandersen:blob_cache_by_shard

Conversation

@henningandersen
Copy link
Contributor

By changing the key-mapping to be two-layer, shard-id, then region-key, we allow introspection by shard-id to be much faster, exemplified in this PR through both removeFromCache and forceEvict(shardId, predicate) options being added. This can be utilized for more purposes like pushing down data for shards as it ages or moves out of the node - or removing data entirely when an index is deleted. It may also remove the need for the async force-evict mechanism.

By changing the key-mapping to be two-layer, shard-id, then region-key, we allow
introspection by shard-id to be much faster, exemplified in this PR through both
removeFromCache and forceEvict(shardId, predicate) options being added. This can
be utilized for more purposes like pushing down data for shards as it ages or
moves out of the node - or removing data entirely when an index is deleted. It
may also remove the need for the async force-evict mechanism.
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Nov 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @henningandersen, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 19, 2025
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Henning. This generally LGTM with a couple concerns.

I left a small suggestion and want to mention this has a memory usage implication as for 100K shards we'll now create 100K CHMs.
The impact of the empty CHMs in this case is a few tens of MiB so perhaps nothing to be too worried about, however I'm a bit concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots) - this could add a few more tens of MiB in memory usage if we're unlucky for 100K shards. Large shards will have many regions so these maps could be quite large.

Shall we test this in a many shards environment? Or is my math a bit off here?

ConcurrentHashMap<RegionKey<KeyType>, LFUCacheEntry> map = keyMapping.get(shard);
if (map != null) {
boolean removed = map.remove(entry.chunk.regionKey, entry);
if (map.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a race condition here I think?
i.e. the map might not be empty by the time we move past this if-check

However, the computeIfPresent already does the "is empty" check. Perhaps this method can be simplified to something like:

private boolean removeKeyMappingForEntry(LFUCacheEntry entry, ShardId shard) {
    boolean[] removed = {false};
    
    keyMapping.computeIfPresent(shard, (k, innerMap) -> {
        removed[0] = innerMap.remove(entry.chunk.regionKey, entry);
        return innerMap.isEmpty() ? null : innerMap;
    });
    
    return removed[0];
}

This way we're accessing the outer map just once and avoiding a race condition on the isEmpty call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the original code proposed in the PR: it checks that no entry was concurrently inserted in the map in the computeIfPresent remapping function and that has no side effect if it is called/reapplied multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think there is a race with the get method, which adds entries to the map outside the CHM lock.

Let me see if I can repair it.

I was considering a ConcurrentSkipListMap approach too, but worry slightly about the lookup being more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revamped this in 544a784

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ConcurrentHashMap<RegionKey<KeyType>, LFUCacheEntry> map = keyMapping.get(shard);
if (map != null) {
boolean removed = map.remove(entry.chunk.regionKey, entry);
if (map.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the original code proposed in the PR: it checks that no entry was concurrently inserted in the map in the computeIfPresent remapping function and that has no side effect if it is called/reapplied multiple times.

@henningandersen
Copy link
Contributor Author

I left a small suggestion and want to mention this has a memory usage implication as for 100K shards we'll now create 100K CHMs.
The impact of the empty CHMs in this case is a few tens of MiB so perhaps nothing to be too worried about, however I'm a bit concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots) - this could add a few more tens of MiB in memory usage if we're unlucky for 100K shards. Large shards will have many regions so these maps could be quite large.

My intution was that it is ok, our shard level overhead is much bigger. I suppose once we get to lazy shards, we could see it change, but just the IndexShard instance is pretty big.

I think the waste in CHM is not too bad, if we have 25 entries, we spend much more RAM on the actual cache entries (~40 bytes) than the waste here (more like 8b).

@henningandersen
Copy link
Contributor Author

concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots)

I think the power of 2 and load factor happens at any scale. While we may be unlucky for one shard that it hits the worst case, we could also be unlucky with the original map that it just crosses the 2x boundary, leading to a similar waste amount overall. With the new approach, we should see the worst case waste much less, since it is very unlikely that we will have 1000 shards that all exhibit the worst case. Whereas a node growing to a worst case just above a 2x boundary seems more likely. So I'd think that if we take out the constant overhead (your prior sentence), this 2x/load factor overhead would be less on the p99 with the new approach.

map.computeIfAbsent(key2, function);
return map;
});
return inner.get(key2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have another thread removing key2 on the inner map between the compute(..) and inner.get(key2) instructions?

If so computeIfAbsent might return a null Value, I think this can cause NPEs later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you are absolutely right, fixed in fe1c2f9

@henningandersen henningandersen requested a review from tlrx November 28, 2025 08:47
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Henning, LGTM

}
}

void forEach(BiConsumer<Key2, Value> consumer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot, it is used in forceEvict here (an unchanged line).

@henningandersen henningandersen merged commit 4c33d20 into elastic:main Dec 3, 2025
35 checks passed
@henningandersen
Copy link
Contributor Author

Thanks Tanguy and Andrei!

breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Searchable Snapshots Searchable snapshots / frozen indices. >enhancement serverless-linked Added by automation, don't add manually Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.3.0

4 participants