Allow fast blob-cache introspection by shard-id by henningandersen · Pull Request #138282 · elastic/elasticsearch

henningandersen · 2025-11-19T10:58:38Z

By changing the key-mapping to be two-layer, shard-id, then region-key, we allow introspection by shard-id to be much faster, exemplified in this PR through both removeFromCache and forceEvict(shardId, predicate) options being added. This can be utilized for more purposes like pushing down data for shards as it ages or moves out of the node - or removing data entirely when an index is deleted. It may also remove the need for the async force-evict mechanism.

elasticsearchmachine · 2025-11-19T10:59:03Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

elasticsearchmachine · 2025-11-19T10:59:03Z

Hi @henningandersen, I've created a changelog YAML for you.

andreidan

Thanks for working on this Henning. This generally LGTM with a couple concerns.

I left a small suggestion and want to mention this has a memory usage implication as for 100K shards we'll now create 100K CHMs.
The impact of the empty CHMs in this case is a few tens of MiB so perhaps nothing to be too worried about, however I'm a bit concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots) - this could add a few more tens of MiB in memory usage if we're unlucky for 100K shards. Large shards will have many regions so these maps could be quite large.

Shall we test this in a many shards environment? Or is my math a bit off here?

andreidan · 2025-11-20T12:48:52Z

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java

+            ConcurrentHashMap<RegionKey<KeyType>, LFUCacheEntry> map = keyMapping.get(shard);
+            if (map != null) {
+                boolean removed = map.remove(entry.chunk.regionKey, entry);
+                if (map.isEmpty()) {


there's a race condition here I think?
i.e. the map might not be empty by the time we move past this if-check

However, the computeIfPresent already does the "is empty" check. Perhaps this method can be simplified to something like:

private boolean removeKeyMappingForEntry(LFUCacheEntry entry, ShardId shard) { boolean[] removed = {false}; keyMapping.computeIfPresent(shard, (k, innerMap) -> { removed[0] = innerMap.remove(entry.chunk.regionKey, entry); return innerMap.isEmpty() ? null : innerMap; }); return removed[0]; }

This way we're accessing the outer map just once and avoiding a race condition on the isEmpty call.

I think I prefer the original code proposed in the PR: it checks that no entry was concurrently inserted in the map in the computeIfPresent remapping function and that has no side effect if it is called/reapplied multiple times.

I do think there is a race with the get method, which adds entries to the map outside the CHM lock.

Let me see if I can repair it.

I was considering a ConcurrentSkipListMap approach too, but worry slightly about the lookup being more complex.

Revamped this in 544a784

tlrx

LGTM

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java

tlrx · 2025-11-20T14:06:35Z

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java

+            ConcurrentHashMap<RegionKey<KeyType>, LFUCacheEntry> map = keyMapping.get(shard);
+            if (map != null) {
+                boolean removed = map.remove(entry.chunk.regionKey, entry);
+                if (map.isEmpty()) {


I think I prefer the original code proposed in the PR: it checks that no entry was concurrently inserted in the map in the computeIfPresent remapping function and that has no side effect if it is called/reapplied multiple times.

henningandersen · 2025-11-21T18:16:25Z

I left a small suggestion and want to mention this has a memory usage implication as for 100K shards we'll now create 100K CHMs.
The impact of the empty CHMs in this case is a few tens of MiB so perhaps nothing to be too worried about, however I'm a bit concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots) - this could add a few more tens of MiB in memory usage if we're unlucky for 100K shards. Large shards will have many regions so these maps could be quite large.

My intution was that it is ok, our shard level overhead is much bigger. I suppose once we get to lazy shards, we could see it change, but just the IndexShard instance is pretty big.

I think the waste in CHM is not too bad, if we have 25 entries, we spend much more RAM on the actual cache entries (~40 bytes) than the waste here (more like 8b).

henningandersen · 2025-11-25T08:13:19Z

concerned on how they grow internally - load factor is 0.75 I believe and they grow in powers of 2 i.e. say we have 24 entries for shard0, capacity is 32, when we add one more entry we'll have 25 entries and capacity 64 (i.e. 39 empty slots)

I think the power of 2 and load factor happens at any scale. While we may be unlucky for one shard that it hits the worst case, we could also be unlucky with the original map that it just crosses the 2x boundary, leading to a similar waste amount overall. With the new approach, we should see the worst case waste much less, since it is very unlikely that we will have 1000 shards that all exhibit the worst case. Whereas a node growing to a worst case just above a 2x boundary seems more likely. So I'd think that if we take out the constant overhead (your prior sentence), this 2x/load factor overhead would be less on the p99 with the new approach.

tlrx · 2025-11-25T12:49:55Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/KeyMapping.java

+            map.computeIfAbsent(key2, function);
+            return map;
+        });
+        return inner.get(key2);


Can't we have another thread removing key2 on the inner map between the compute(..) and inner.get(key2) instructions?

If so computeIfAbsent might return a null Value, I think this can cause NPEs later?

Thanks, you are absolutely right, fixed in fe1c2f9

tlrx

LGTM

andreidan

thanks Henning, LGTM

andreidan · 2025-12-03T10:36:23Z

x-pack/plugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/KeyMapping.java

+        }
+    }
+
+    void forEach(BiConsumer<Key2, Value> consumer) {


can we make this private?

We cannot, it is used in forceEvict here (an unchanged line).

henningandersen · 2025-12-03T11:50:54Z

Thanks Tanguy and Andrei!

Stateless changes for elastic#138282

henningandersen requested review from andreidan and tlrx November 19, 2025 10:58

henningandersen added >enhancement :Distributed/Searchable Snapshots Searchable snapshots / frozen indices. v9.3.0 labels Nov 19, 2025

elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Nov 19, 2025

Update docs/changelog/138282.yaml

15fd0f1

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 19, 2025

andreidan reviewed Nov 20, 2025

View reviewed changes

tlrx approved these changes Nov 20, 2025

View reviewed changes

henningandersen and others added 4 commits November 25, 2025 08:51

Fix race-condition and make key mapping testable separately.

544a784

Add comment for getFreq.

7515470

Fix force evict by shard (and add assert)

8f2f947

Merge branch 'main' into blob_cache_by_shard

768e037

henningandersen requested review from andreidan and tlrx November 25, 2025 08:06

tlrx reviewed Nov 25, 2025

View reviewed changes

Fix race condition

fe1c2f9

henningandersen requested a review from tlrx November 28, 2025 08:47

tlrx approved these changes Nov 28, 2025

View reviewed changes

Merge branch 'main' into blob_cache_by_shard

daddec8

andreidan approved these changes Dec 3, 2025

View reviewed changes

henningandersen merged commit 4c33d20 into elastic:main Dec 3, 2025
35 checks passed

breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026

Allow fast blob-cache introspection by shard-id (elastic#4856)

fc438d8

Stateless changes for elastic#138282

breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026

Allow fast blob-cache introspection by shard-id (elastic#4856)

a2e420e

Stateless changes for elastic#138282

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fast blob-cache introspection by shard-id#138282

Allow fast blob-cache introspection by shard-id#138282
henningandersen merged 8 commits intoelastic:mainfrom
henningandersen:blob_cache_by_shard

henningandersen commented Nov 19, 2025

elasticsearchmachine commented Nov 19, 2025

elasticsearchmachine commented Nov 19, 2025

andreidan left a comment

andreidan Nov 20, 2025

tlrx Nov 20, 2025

henningandersen Nov 21, 2025

henningandersen Nov 25, 2025

tlrx left a comment

Uh oh!

Uh oh!

Uh oh!

tlrx Nov 20, 2025

henningandersen commented Nov 21, 2025

henningandersen commented Nov 25, 2025

tlrx Nov 25, 2025

henningandersen Nov 28, 2025

tlrx left a comment

andreidan left a comment

andreidan Dec 3, 2025

henningandersen Dec 3, 2025

Uh oh!

henningandersen commented Dec 3, 2025

Labels

4 participants

Conversation

henningandersen commented Nov 19, 2025

elasticsearchmachine commented Nov 19, 2025

elasticsearchmachine commented Nov 19, 2025

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

henningandersen commented Nov 21, 2025

henningandersen commented Nov 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx left a comment

Choose a reason for hiding this comment

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

henningandersen commented Dec 3, 2025

Labels

4 participants