Can Match and DFS search phase shard search APM metrics#135285
Can Match and DFS search phase shard search APM metrics#135285chrisparrinello wants to merge 14 commits intoelastic:mainfrom
Conversation
d22b164 to
6624bad
Compare
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @chrisparrinello, I've created a changelog YAML for you. |
javanna
left a comment
There was a problem hiding this comment.
I left a couple of comments, DFS looks good, can match needs some adjustments. Thanks!
| isSystem = ((EsExecutors.EsThread) Thread.currentThread()).isSystem(); | ||
| } else { | ||
| isSystem = false; | ||
| } |
There was a problem hiding this comment.
is this fixing an issue you experienced ? I thought that we are guaranteed to always have an instance of EsThread here.
There was a problem hiding this comment.
Yes, the issue is that a lot of tests start failing because the search is being run on a generic Thread and not on EsThread and cause ClassCastException errors.
There was a problem hiding this comment.
Would a better approach be to try and use EsExecutor.EsThreads when running those tests? Not sure what the consequences would be but might be better than that instanceof check.
There was a problem hiding this comment.
I am a bit confused on this: isn't this existing code? does this fail on main then or only in this PR without your change?
There was a problem hiding this comment.
I backed out this change and so far CI tests are passing. I think I ran into this problem when I was working on the coordinator level metrics so I'll tackle it again when I create the PR for those changes.
There was a problem hiding this comment.
it does not seem like you backed it out, the change is still in the diff?
| maxKeepAlive | ||
| ); | ||
| CanMatchShardResponse canMatchResp = canMatch(canMatchContext, false); | ||
| opsListener.onCanMatchPhase(orig, System.nanoTime() - beforeCanMatchTime); |
There was a problem hiding this comment.
this isn't the right can match call to instrument :) this is indeed confusing. we call can match in multiple places, and this one is as part of the query phase itself. What I'd like to track is the can match latency as part of the separate can match roundtrip we have. CanMatchPreFilterSearchPhase. The originating call on data nodes is SearchService#canMatch(CanMatchNodeRequest request, ActionListener<CanMatchNodeResponse> listener). I would think that the tracking fits in that specific method.
There was a problem hiding this comment.
Got it. I'll move the instrumenting there. 🤞 the opsListener is available there. That also seems to be a sticking point with some of this stuff.
There was a problem hiding this comment.
I moved the call but as a result, I had to inject the ShardSearchPhaseAPMMetrics singleton into the Transport*Action classes.
There was a problem hiding this comment.
You are now tracking the coordinator rewrite phase, which is something else :) Can you check the method I pointed you to above? I believe that's where we need to record the execution time for can match on the shards: SearchService#canMatch(CanMatchNodeRequest request, ActionListener<CanMatchNodeResponse> listener).
this stuff is quite hard to follow in the code. Can match is run first on the coord node, as a rewrite step to filter out indices that the coord node knows cannot possibly match. We don't want to track the latency of this, because it does not involve roundtrips to the nodes. Next, an optional can match search phase is executed which goes to all data nodes involved and tries to filter out shards that cannot possibly match based on metadata that's only available in the data nodes (without executing the query!). This is the round that we want to measure latency for. Both at the coord node (later) and at the shard level (in this PR). note that the method I pointed you to is executed on each data node, and loops over all the shards that are searched which are allocated on that data node. I am thinking we should track the can match phase at the data node level here, as opposed to for each shard. Makes sense?
server/src/main/java/org/elasticsearch/index/shard/SearchOperationListener.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/SearchOperationListener.java
Show resolved
Hide resolved
...src/test/java/org/elasticsearch/search/TelemetryMetrics/ShardSearchPhaseAPMMetricsTests.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/elasticsearch/search/TelemetryMetrics/ShardSearchPhaseAPMMetricsTests.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchRequestAttributesExtractor.java
Outdated
Show resolved
Hide resolved
| assert request.canReturnNullResponseIfMatchNoDocs() == false || request.numberOfShards() > 1 | ||
| : "empty responses require more than one shard"; | ||
| final IndexShard shard = getShard(request); | ||
| final var opsListener = shard.getSearchOperationListener(); |
There was a problem hiding this comment.
I wonder why aren't we using SearchTimeProvider? Maybe not relevant for this patch, just curious.
There was a problem hiding this comment.
I think during a few iterations of me working on this, SearchTimeProvider isn't accessible (or even exists) in some phases and would have to be pushed down the call chain. It seemed like a lot of work for something that wraps System.nanoTime calls and we're not testing the durations so we don't need a mock provider with deterministic times or something like that to assert against.
There was a problem hiding this comment.
How would you use search time provider to track a long an operation took in a specific shard? Doesn't that provide the start time of the search on the coord node?
There was a problem hiding this comment.
@javanna SearchTimeProvider.relativeCurrentTimeNanosProvider is set to System::nanoTime in the code.
server/src/main/java/org/elasticsearch/index/shard/SearchOperationListener.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/SearchOperationListener.java
Show resolved
Hide resolved
smalyshev
left a comment
There was a problem hiding this comment.
Generally looks ok, left some nitpick comments. I'm not familiar with these parts, so I'll let Luca confirm that's what we need.
| assert assertSearchCoordinationThread(); | ||
| final List<SearchShardIterator> matchedShardLevelRequests = new ArrayList<>(); | ||
| for (SearchShardIterator searchShardIterator : shardsIts) { | ||
| long startTime = System.nanoTime(); |
There was a problem hiding this comment.
Don't see this used anywhere?
There was a problem hiding this comment.
Right. It's not. This was from a previous iteration of this PR. I'll remove.
| } | ||
|
|
||
| List<SearchOperationListener> getSearchOperationListener() { // pkg private for testing | ||
| public List<SearchOperationListener> getSearchOperationListener() { // pkg private for testing |
There was a problem hiding this comment.
Probably the better name would be to use plural: getSearchOperationListeners. It took me some minutes to realize it actually returns a list of listeners and not a single listener. Also, the comment is no longer accurate.
There was a problem hiding this comment.
Agreed. Will change.
| indexService = readerContext.indexService(); | ||
| QueryRewriteContext queryRewriteContext = canMatchContext.getQueryRewriteContext(indexService); | ||
| if (queryStillMatchesAfterRewrite(canMatchContext.request, queryRewriteContext) == false) { | ||
| indexService.getSearchOperationListener() |
There was a problem hiding this comment.
This particular construct repeats in several places, maybe make a private method out of it? Or just place it in the finally clause? Or maybe both?
| assertEquals(0, queryMeasurements.size()); | ||
| final List<Measurement> fetchMeasurements = getTestTelemetryPlugin().getLongHistogramMeasurement(FETCH_SEARCH_PHASE_METRIC); | ||
| assertEquals(0, fetchMeasurements.size()); | ||
| } |
There was a problem hiding this comment.
One thing that is missing for me in these tests that it doesn't seem to check actual measurements - it only checks size() but not the measurements themselves. Is this enough? I know it's hard to predict timings, at least on tests, but maybe at least ensure it's not always zero? I think we have special query syntax also to make the query slower - not sure if it'd apply here... or maybe some transport behaviors.
Another question I'm curious about - can any of these phases fail? Are we interested in knowing if they do?
There was a problem hiding this comment.
Yeah the check of the values inside the unit tests are going to be problematic because the unit for the measurements are milliseconds and it is highly unlikely we'll have a duration longer than a millisecond for some of these measurements. I did check while running via gradlew run and there are non-zero measurements for some of these in the APM server logs.
| b.bind(ShutdownPrepareService.class).toInstance(shutdownPrepareService); | ||
| b.bind(OnlinePrewarmingService.class).toInstance(onlinePrewarmingService); | ||
| b.bind(MergeMetrics.class).toInstance(mergeMetrics); | ||
| b.bind(ShardSearchPhaseAPMMetrics.class).toInstance(shardSearchPhaseAPMMetrics); |
There was a problem hiding this comment.
I am curious - why this addition is necessary, what is it doing? It's not a new class, so what changed?
There was a problem hiding this comment.
This might be leftover from a previous iteration where I wanted to inject into one of the services (IndicesService?) and guarantee it is a singleton. I'll check to see if it is safe to remove which I think it is at this point.
There was a problem hiding this comment.
I think it is safe to remove so I'll remove it.
3dd8c0c to
d43748a
Compare
| @Override | ||
| public void onDfsPhase(SearchContext searchContext, long tookInNanos) { | ||
| SearchExecutionContext searchExecutionContext = searchContext.getSearchExecutionContext(); | ||
| Long rangeTimestampFrom = searchExecutionContext.getRangeTimestampFrom(); |
There was a problem hiding this comment.
Did you see that the DFS phase reports the time range attribute? I would think this does nothing, cause I would not expect DFS to actually parse the query, it will only collect statistics and run the knn search. Unless I am missing something, that would be a reason to just pass null for DFS too.
I am also a bit way of getting the search execution context. I would not be surprised if there are cases when that is null.
There was a problem hiding this comment.
I ran a check and it doesn't report the time range given one of those time range queries from the unit tests. Would it be safer to just record the per shard took time and not try to record any of the search attributes? I haven't seen any cases so far where the search execution context is null.
| canMatchContext.getIndexService().getSearchOperationListeners(), | ||
| origShardSearchRequest, | ||
| startTime | ||
| ); |
There was a problem hiding this comment.
I would like to record this only in one place, the can match phase that requires a separate roundtrip. This seems to record the metric in many different places which is not necessary. The method that takes an action listener should do it. That's the only place where we need to record it. See previous comment at #135285 (comment) .
The rationale for doing it there only is to focus on the actual can match phase that requires a roundtrip to the data nodes, and have some correlation between coord node latency of that, and data node latency of that.
In the method I pointed you to, you'll find a loop. That code executes on each data node, looping over the shard targets. I am now even wondering if we want to track the latency at the shard level, or at the data node level. The latter may make more sense. We don't need shard granularity here, it's overkill.
That does complicate things around attributes as you won't have a ShardSearchRequest. I'd change the signature of the listener method to take a CanMatchNodeRequest. Introspecting that should work, what is not available there we can probably skip. This does require a bit more plumbing in SearchRequestAttributesExtractor. I prefer it over cloning the shard request too, which is a bit intrusive.
In hindsight, it would have been wiser to split dfs and can match. I would still do that. Get DFS in as it's the simplest, then focus on can match and see if that can also be split up further?
There was a problem hiding this comment.
Okay I'll pull out the DFS instrumentation into a separate PR as a (hopefully) quick merge and then look into the can-match changes you suggested above.
There was a problem hiding this comment.
oh, doing it per data node, takes away the ability to rely on SearchOperationListener for the callback, as that is per index/shard (depending on where it's retrieved from between IndexService or IndexShard). Maybe that creates too many artificial problems, as we don't have a listener for data node level search events.
This deserves a bit more thinking. Let me know if the trade-offs here make sense to you? I have yet to make up my mind entirely :)
There was a problem hiding this comment.
Yeah having access to the SearchOperationListener via something to be able to call indirectly the ShardSearchPhaseAPMMetrics object and have the ShardSearchRequest to extract the attributes for the metric have been a little bit of a struggle with this.
If we want to have a per data node can-match metric, I'd do via another APM metrics handling class to keep the ShardSearchPhaseAPMMetrics just related to shard metrics anyway. The problem with that is getting access to that in some of the canMatch methods. I've been down this road a little bit with injecting the APMMetrics objects into the service objects in the Guice-y stuff in the NodeConstructor.
Implements the following APM metrics to record the per-shard duration of these search phases: