Skip to content

Fix SearchContext CB memory accounting#138002

Merged
luigidellaquila merged 17 commits intoelastic:mainfrom
luigidellaquila:fetchphase_cb
Nov 24, 2025
Merged

Fix SearchContext CB memory accounting#138002
luigidellaquila merged 17 commits intoelastic:mainfrom
luigidellaquila:fetchphase_cb

Conversation

@luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Nov 13, 2025

Fixing FetchPhase memory management

Problems and TODO (spotted so far):

  • SearchContext.checkRealMemoryCB doesn't account for CB memory (always zero) - this just triggers Real Memory CB, but it's not enough apparently
  • sometimes FetchPhase.buildSearchHits batches are too small (eg. in TopHits), the memory buffer never accumulates enough to be tracked
  • We don't release CB memory, we only rely on real memory CB
  • plumb TopHitsAggregator memory management lifecycle
  • Add tests triggering CB
  • plumb InnerHitsPhase memory management lifecycle
  • plumb SearchService.execute*Phase() memory management lifecycle
  • Fall back to try/finally logic for scenarios that don't have a clear CB lifecycle
  • TopHitsAggregator.subSearchContext.closeFuture grows too much - this is due to this block, so it's irrelevant in prod.

Fixes: #136836

}
if (context.checkRealMemoryCB(locallyAccumulatedBytes[0], "fetch source")) {
// if we checked the real memory breaker, we restart our local accounting
locallyAccumulatedBytes[0] = 0;
Copy link
Contributor Author

@luigidellaquila luigidellaquila Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time these batches were too small, so this didn't trigger.

docIdsToLoad[i] = topDocs.scoreDocs[i].doc;
}
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad);
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad, this::addRequestCircuitBreakerBytes);
Copy link
Contributor Author

@luigidellaquila luigidellaquila Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we batch here and avoid invoking the CB for every document?
Maybe addRequestCircuitBreakerBytes should take care of this?

I suspect that fetching source is way more expensive than invoking the CB, so I'm not sure we want more complication here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth checking an esrally benchmark here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run some aggs nightlies and see what happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran nyc-taxis track, and I see no slowdown (actually it seems faster).
Now I'm running NOAA, that contains more aggs

@luigidellaquila luigidellaquila marked this pull request as ready for review November 14, 2025 08:56
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

public final boolean checkRealMemoryCB(int locallyAccumulatedBytes, String label) {
if (locallyAccumulatedBytes >= memAccountingBufferSize()) {
circuitBreaker().addEstimateBytesAndMaybeBreak(0, label);
circuitBreaker().addEstimateBytesAndMaybeBreak(locallyAccumulatedBytes, label);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the crux

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkRealMemoryCB() functions usually add 0 explicitly to force a memory check; are we sure we want to add memory here?
My major questions here would be:

  1. Are we freeing it later?
  2. Should we rename the method then?
  3. This conditionally adds the memory depending on the value, which may lead to wrongly freeing memory later (?)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that nobody is using this method anymore (apart from a unit test).

The checkRealMemoryCB() functions usually add 0 explicitly to force a memory check; are we sure we want to add memory here?

Maybe this was the intention at the beginning, but apparently it was not so effective, since in my tests the CB was always empty. And if I remember well, it was a ChildMemoryCircuitBreaker

Are we freeing it later?

With this fix I'm delegating the memory accounting to AggregatorBase, that handles tracking and releasing CB memory, so we should be safe.

Should we rename the method then?

I think we can just delete it

This conditionally adds the memory depending on the value, which may lead to wrongly freeing memory later

I guess, since the CB memory management is delegated to AggregatorBase, we should be safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be conservative here, I'll make the memoryChecker nullable, and in case I'll keep trying to triggering the real memory CB

@luigidellaquila luigidellaquila added the :Search Foundations/Search Catch all for Search Foundations label Nov 17, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Nov 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

// This works most of the time, but it's not consistent: it still triggers OOM sometimes.
// The test env is too small and non-deterministic to hold all these data and results.
@AwaitsFix(bugUrl = "see comment above")
public void testBreakAndRecover() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to have this test working, but I couldn't find a way to make it pass deterministically; depending on the system conditions sometimes it OOMs before CB, and if I reduce the memory further, it doesn't CB anymore.

We have unit tests for FetchPhase CB, but having an integration test would be really really good.

I don't know if we should keep this here or delete it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I had a similar problem with the aggs reduction phase memory test, which is now muted: #134667

Test works, but it fails sometimes. I tried tweaking it to have an exact amount of nodes of each kind, as well as docs, queyr limits and CB settings, and it improved, but still flaky.

Maybe you could try forcing a set amount of nodes, like in here:

@ESIntegTestCase.ClusterScope(minNumDataNodes = 1, maxNumDataNodes = 2, numClientNodes = 1)

Less random, but less flaky (luckily 🤞 )

@luigidellaquila
Copy link
Contributor Author

luigidellaquila commented Nov 21, 2025

Thanks for the contribution and for the off-line conversation @drempapis
If you have a chance, please have a look and see if it covers your expectations.
Unfortunately I can't rely on the try/finally logic in the Aggs scenarios, since the batches are too small and I need to keep the memory accounted for the whole duration of the aggregation, that could span hundreds (sometimes thousands) of FetchPhase executions

* @return true if the circuit breaker is called and false otherwise
*/
public final boolean checkRealMemoryCB(int locallyAccumulatedBytes, String label) {
public final boolean checkCircuitBreaker(int locallyAccumulatedBytes, String label) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer RealMemory, but rather normal (Request) CB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, is there something else using this, that could not be freeing the accounted memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only usage for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add to the Javadoc that callers are responsible for decrementing the breaker by the same amount at some point.

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I would wait for another review if possible, apart of the benchmark, if it makes sense

* @return true if the circuit breaker is called and false otherwise
*/
public final boolean checkRealMemoryCB(int locallyAccumulatedBytes, String label) {
public final boolean checkCircuitBreaker(int locallyAccumulatedBytes, String label) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, is there something else using this, that could not be freeing the accounted memory?

docIdsToLoad[i] = topDocs.scoreDocs[i].doc;
}
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad);
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad, this::addRequestCircuitBreakerBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth checking an esrally benchmark here?

// This works most of the time, but it's not consistent: it still triggers OOM sometimes.
// The test env is too small and non-deterministic to hold all these data and results.
@AwaitsFix(bugUrl = "see comment above")
public void testBreakAndRecover() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I had a similar problem with the aggs reduction phase memory test, which is now muted: #134667

Test works, but it fails sometimes. I tried tweaking it to have an exact amount of nodes of each kind, as well as docs, queyr limits and CB settings, and it improved, but still flaky.

Maybe you could try forcing a set amount of nodes, like in here:

@ESIntegTestCase.ClusterScope(minNumDataNodes = 1, maxNumDataNodes = 2, numClientNodes = 1)

Less random, but less flaky (luckily 🤞 )

….java

Co-authored-by: Iván Cea Fontenla <ivancea96@outlook.com>
@luigidellaquila
Copy link
Contributor Author

Buildkite benchmark this with nyc_taxis-1n-8g please

@luigidellaquila
Copy link
Contributor Author

Buildkite benchmark this with wikipedia please

@luigidellaquila
Copy link
Contributor Author

Buildkite benchmark this with noaa-1n-1g please

@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 24, 2025

💚 Build Succeeded

This build ran two noaa-1n-1g benchmarks to evaluate performance impact of this PR.

History


private long requestBreakerBytes;

public void addRequestBreakerBytes(long delta) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’d be good to document that this field is strictly for fetch-phase memory accounting, will be reversed at the end of the FetchPhase, and shouldn’t be accessed or modified by subclasses. (if any added in the future)

BytesReference sourceRef = hit.hit().getSourceRef();
if (sourceRef != null) {
locallyAccumulatedBytes[0] += sourceRef.length();
// This is an empirical value that seems to work well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the logic, we could create an IntConsumer once when constructing the FetchPhaseDocsIterator

IntConsumer checker = memoryChecker != null
    ? memoryChecker
    : bytes -> {
        locallyAccumulatedBytes[0] += bytes;
        if (context.checkCircuitBreaker(locallyAccumulatedBytes[0], "fetch source")) {
            addRequestBreakerBytes(locallyAccumulatedBytes[0]);
            locallyAccumulatedBytes[0] = 0;
        }
    };

and simply call

if (sourceRef != null) {
    checker.accept(sourceRef.length() * 2);
}

We should double check if that works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @drempapis, I'll give it a try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change and yes, the code looks much better now 👍

@luigidellaquila
Copy link
Contributor Author

Thanks for the reviews

@ivancea the benchmarks didn't show any significant changes in performance.

@drempapis please let me know if you have any further comments, otherwise I'll more forward and merge.

Thanks!

Copy link
Contributor

@drempapis drempapis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @luigidellaquila

@luigidellaquila luigidellaquila enabled auto-merge (squash) November 24, 2025 13:05
@luigidellaquila luigidellaquila merged commit d2b4355 into elastic:main Nov 24, 2025
34 checks passed
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 26, 2025
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
} finally {
long bytes = docsIterator.getRequestBreakerBytes();
if (bytes > 0L) {
context.circuitBreaker().addWithoutBreaking(-bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila @drempapis Sorry for the late comment here. Thanks for iterating on this.

I think we're releasing the bytes from the circuit breaker too soon here? The response has not been sent to the client yet so these hits are still in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreidan I agree, we should do better here.
IMHO the right approach would be to manage the CB allocation/release together with the request lifecycle. That's what I did for Aggs, using a memoryChecker that was already part of the aggregation memory accounting, but for Search I couldn't find anything similar. I don't know that part of the codebase in depth though, so I could have missed something obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreidan, we missed it!
I’ve started an alternative here: #139124. It’s WIP and expected to take more time.

I’ll work on it in parallel and make it a priority, given the complexity of the alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context. While #139124 (which looks like an excellent way to move forward) is under development, how confident are we with this change ? Do we have an overview of which queries and scenarios it improves over the previous implementation? and perhaps more importantly where it might fall a bit short to what we had previously?

From the PR description it seems that the thing that we fixed was:

sometimes FetchPhase.buildSearchHits batches are too small (eg. in TopHits), the memory buffer never accumulates enough to be tracked

but we didn't have to replace the real CB to achieve this. Did we consider triggering the real memory CB on a doc count too? (i.e. accumulated 1MiB in source size or retrieved 1024 documents)

Copy link
Contributor Author

@luigidellaquila luigidellaquila Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreidan looking at previous discussions (eg. this and the internal ES-12944), my impression is that we don't want to rely too much on the real memory CB.

I started this PR mostly to address memory problems with TopHits (as for the comment you mentioned above); my original implementation was intended to use the request breaker when the fetch phase was part of Aggs, but based on the discussions with @drempapis (again, see the internal ES-12944) we ended up using it also for the general case (ie. for Search). I'll let @drempapis comment on this and on the possible regressions, as he probably has more visibility the Search side. For Aggs we are definitely in a much better position now.

Did we consider triggering the real memory CB on a doc count too? (i.e. accumulated 1MiB in source size or retrieved 1024 documents)

Not that I know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ES-12944 we analyzed a core issue: on nodes with small JVM heaps, large per-document _source payloads combined with large size values, cause the fetch phase to materialize a very large SearchHit[] fully in memory. This is a limitation of the current fetch implementation, which constructs the entire hit array and all _source bytes on-heap before returning anything to the client.

For search memory accounting, the change introduced here ensures that _source bytes are now charged to the request cb. This helps by tripping earlier in scenarios that previously bypassed breaker enforcement. However, it only partially improves the situation. It does not fully resolve the underlying low-heap, large-source problem because the fetch phase continues to eagerly materialize the full response on-heap.

Even when the request breaker is active, its default limit (60 percent of heap) is often too high to protect a small-heap node once baseline memory usage is included. The parent-memory breaker only fires at very high global heap occupancy and usually reacts too late, after most of the large objects are already live.

I opened a PR to mitigate the issue of releasing bytes from the circuit breaker too early while the searchHits are still resident in memory. This is a short-term fix until a more complete solution will be implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah excellent, thanks for working on #139243 Dimi ❤️ Awesome we're jumping right on it to fix it!
Let me know when it's ready and I'll review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >bug :Search Foundations/Search Catch all for Search Foundations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0

6 participants