Skip to content

Fetch search phase coordinator duration APM metric#136547

Merged
chrisparrinello merged 7 commits intoelastic:mainfrom
chrisparrinello:fetch_phase_coordinator_metric
Oct 15, 2025
Merged

Fetch search phase coordinator duration APM metric#136547
chrisparrinello merged 7 commits intoelastic:mainfrom
chrisparrinello:fetch_phase_coordinator_metric

Conversation

@chrisparrinello
Copy link
Contributor

Adds the following APM metric to track the duration of the fetch phase at the coordinator node:

  • es.search_response.took_durations.fetch.histogram
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 14, 2025
@chrisparrinello chrisparrinello force-pushed the fetch_phase_coordinator_metric branch from b7e2eeb to 10647e4 Compare October 14, 2025 15:15
@chrisparrinello chrisparrinello added >enhancement Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Oct 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @chrisparrinello, I've created a changelog YAML for you.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment around testing, LGTM though. Thanks!

SearchPhaseController.ReducedQueryPhase reducedQueryPhase,
long phaseStartTimeInNanos
) {
context.getSearchResponseMetrics().recordSearchPhaseDuration(getName(), System.nanoTime() - phaseStartTimeInNanos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to double check the code to make sure that the time tracking is terminated in the right place, hence including all the roundtrips to the shards etc. I am a bit anxious about the tests not covering that. If we always returned 0 or tracking time in the wrong places, we would not catch it. Is it something that we can do something about maybe as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me summarize what I understand your concerns to be if I understand them correctly from this comment and others:

  1. Make sure we're recording a sane duration (did we record a proper start time, did we calculate end minus start time properly).
  2. Did we record the start and end time in the correct location and in the correct sequence of events?

For 1, we do get some protection from the underlying TelemetryProvider implementation which will not allow you to record a negative number in a histogram. Not in this PR but in other PRs where we're keeping track of the start phase time in the instance, we could initialize that to be Long.MAX_VALUE to guarantee that any failure to set the start time of the phase should result in a negative duration. The fact we lose so much resolution converting from nanos to millis kinda limits us here. I think this would be easy to do as a follow-on if that gives us more confidence in these metrics.

For 2, I think we could take this case by case probably at the unit test level for each of the phases. We'd inject wrapped implementations of the two APM gathering classes (phase and coordinator) to record the wall clock time of when they were called, the durations passed in, etc. and then do a sanity check of the timeline of calls (i.e. did all of the shard calls occur before the coordinator phase complete call occur). Could be an interesting exercise as a follow-on.

@chrisparrinello chrisparrinello merged commit 7438ee8 into elastic:main Oct 15, 2025
34 checks passed
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Oct 16, 2025
Adds the following APM metric to track the duration of the fetch phase at the coordinator node:
- es.search_response.took_durations.fetch.histogram
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0

4 participants