Fetch search phase coordinator duration APM metric#136547
Fetch search phase coordinator duration APM metric#136547chrisparrinello merged 7 commits intoelastic:mainfrom
Conversation
b7e2eeb to
10647e4
Compare
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @chrisparrinello, I've created a changelog YAML for you. |
server/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
Outdated
Show resolved
Hide resolved
javanna
left a comment
There was a problem hiding this comment.
left one comment around testing, LGTM though. Thanks!
| SearchPhaseController.ReducedQueryPhase reducedQueryPhase, | ||
| long phaseStartTimeInNanos | ||
| ) { | ||
| context.getSearchResponseMetrics().recordSearchPhaseDuration(getName(), System.nanoTime() - phaseStartTimeInNanos); |
There was a problem hiding this comment.
I had to double check the code to make sure that the time tracking is terminated in the right place, hence including all the roundtrips to the shards etc. I am a bit anxious about the tests not covering that. If we always returned 0 or tracking time in the wrong places, we would not catch it. Is it something that we can do something about maybe as a followup?
There was a problem hiding this comment.
Let me summarize what I understand your concerns to be if I understand them correctly from this comment and others:
- Make sure we're recording a sane duration (did we record a proper start time, did we calculate end minus start time properly).
- Did we record the start and end time in the correct location and in the correct sequence of events?
For 1, we do get some protection from the underlying TelemetryProvider implementation which will not allow you to record a negative number in a histogram. Not in this PR but in other PRs where we're keeping track of the start phase time in the instance, we could initialize that to be Long.MAX_VALUE to guarantee that any failure to set the start time of the phase should result in a negative duration. The fact we lose so much resolution converting from nanos to millis kinda limits us here. I think this would be easy to do as a follow-on if that gives us more confidence in these metrics.
For 2, I think we could take this case by case probably at the unit test level for each of the phases. We'd inject wrapped implementations of the two APM gathering classes (phase and coordinator) to record the wall clock time of when they were called, the durations passed in, etc. and then do a sanity check of the timeline of calls (i.e. did all of the shard calls occur before the coordinator phase complete call occur). Could be an interesting exercise as a follow-on.
Adds the following APM metric to track the duration of the fetch phase at the coordinator node: - es.search_response.took_durations.fetch.histogram
Adds the following APM metric to track the duration of the fetch phase at the coordinator node: