Fetch search phase coordinator duration APM metric by chrisparrinello · Pull Request #136547 · elastic/elasticsearch

chrisparrinello · 2025-10-14T15:14:44Z

Adds the following APM metric to track the duration of the fetch phase at the coordinator node:

es.search_response.took_durations.fetch.histogram

elasticsearchmachine · 2025-10-14T15:16:36Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-10-14T15:16:37Z

Hi @chrisparrinello, I've created a changelog YAML for you.

server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java

…ed on the same thread

javanna

left one comment around testing, LGTM though. Thanks!

javanna · 2025-10-15T18:05:39Z

server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java

+        SearchPhaseController.ReducedQueryPhase reducedQueryPhase,
+        long phaseStartTimeInNanos
    ) {
+        context.getSearchResponseMetrics().recordSearchPhaseDuration(getName(), System.nanoTime() - phaseStartTimeInNanos);


I had to double check the code to make sure that the time tracking is terminated in the right place, hence including all the roundtrips to the shards etc. I am a bit anxious about the tests not covering that. If we always returned 0 or tracking time in the wrong places, we would not catch it. Is it something that we can do something about maybe as a followup?

Let me summarize what I understand your concerns to be if I understand them correctly from this comment and others:

Make sure we're recording a sane duration (did we record a proper start time, did we calculate end minus start time properly).

Did we record the start and end time in the correct location and in the correct sequence of events?

For 1, we do get some protection from the underlying TelemetryProvider implementation which will not allow you to record a negative number in a histogram. Not in this PR but in other PRs where we're keeping track of the start phase time in the instance, we could initialize that to be Long.MAX_VALUE to guarantee that any failure to set the start time of the phase should result in a negative duration. The fact we lose so much resolution converting from nanos to millis kinda limits us here. I think this would be easy to do as a follow-on if that gives us more confidence in these metrics.

For 2, I think we could take this case by case probably at the unit test level for each of the phases. We'd inject wrapped implementations of the two APM gathering classes (phase and coordinator) to record the wall clock time of when they were called, the durations passed in, etc. and then do a sanity check of the timeline of calls (i.e. did all of the shard calls occur before the coordinator phase complete call occur). Could be an interesting exercise as a follow-on.

Adds the following APM metric to track the duration of the fetch phase at the coordinator node: - es.search_response.took_durations.fetch.histogram

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 14, 2025

fetch search phase coordinator metric

10647e4

chrisparrinello force-pushed the fetch_phase_coordinator_metric branch from b7e2eeb to 10647e4 Compare October 14, 2025 15:15

chrisparrinello added >enhancement Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Oct 14, 2025

Update docs/changelog/136547.yaml

8e4658a

chrisparrinello requested review from javanna and smalyshev October 14, 2025 16:51

smalyshev approved these changes Oct 14, 2025

View reviewed changes

javanna suggested changes Oct 15, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java Outdated Show resolved Hide resolved

move start time recording to ensure it is on same thread as execution

b324e1d

chrisparrinello requested a review from javanna October 15, 2025 14:42

chrisparrinello and others added 4 commits October 15, 2025 10:03

Merge branch 'main' into fetch_phase_coordinator_metric

5ee64d7

move phase start time to local variable to guarantee it is set and us…

d6345a9

…ed on the same thread

[CI] Auto commit changes from spotless

629eb7f

Merge branch 'main' into fetch_phase_coordinator_metric

300db39

javanna approved these changes Oct 15, 2025

View reviewed changes

chrisparrinello merged commit 7438ee8 into elastic:main Oct 15, 2025
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch search phase coordinator duration APM metric#136547

Fetch search phase coordinator duration APM metric#136547
chrisparrinello merged 7 commits intoelastic:mainfrom
chrisparrinello:fetch_phase_coordinator_metric

chrisparrinello commented Oct 14, 2025

elasticsearchmachine commented Oct 14, 2025

elasticsearchmachine commented Oct 14, 2025

Uh oh!

javanna left a comment

javanna Oct 15, 2025

chrisparrinello Oct 15, 2025

Uh oh!

Labels

4 participants

Conversation

chrisparrinello commented Oct 14, 2025

elasticsearchmachine commented Oct 14, 2025

elasticsearchmachine commented Oct 14, 2025

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

javanna Oct 15, 2025

Choose a reason for hiding this comment

chrisparrinello Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants