Optimize geogrid functions to read points from doc-values#138917
Optimize geogrid functions to read points from doc-values#138917craigtaverner merged 14 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @craigtaverner, I've created a changelog YAML for you. |
Older nodes don't know about the optimization and will get errors like:
IllegalStateException: Expected [BYTES_REF] but was [LONG]
When running in mixed clusters.
Since before now we only ever supported loading from doc values when the spatial type was consumed by an aggregation, this was not needed. But now the geogrid function can be used even without an aggregation. The new rule is that as long as the original spatial type is not returned in the final output (needing original source), we can load from doc values. But TopNExec has an issue with the ElementType in this regard.
It is entirely unneeded since this optimization runs on the data nodes only.
ncordon
left a comment
There was a problem hiding this comment.
I left a couple of questions and a suggestion to improve the test coverage
| */ | ||
| public class SpatialDocValuesExtraction extends PhysicalOptimizerRules.ParameterizedOptimizerRule< | ||
| AggregateExec, | ||
| UnaryExec, |
There was a problem hiding this comment.
Is UnaryExec a physical plan wrapping a unique child?
There was a problem hiding this comment.
I just needed a parent class in common with both AggregateExec and TopNExec, which are the two types I really care about. I could invent a new class just to bind these two "pipeline breakers" together, but thought it probably worked well enough to use the common parent class. This does mean we will look into this rule a lot more than needed, but I'm assuming the consequences of that are negligable.
| List<Layout.ChannelSet> inverse = source.layout.inverse(); | ||
| for (int channel = 0; channel < inverse.size(); channel++) { | ||
| elementTypes[channel] = PlannerUtils.toElementType(inverse.get(channel).type()); | ||
| var fieldExtractPreference = fieldExtractPreference(topNExec, inverse.get(channel).nameIds()); |
There was a problem hiding this comment.
Probably I'm reading and understanding this wrongly, but we could have several attributes here in the TopN and we are going to treat them all as doc values if any of them are? Is that correct?
There was a problem hiding this comment.
The decision should be per-channel, so if one requires doc-values, only that would will get that ElementType change. There should be no leaking between channels.
| if (aggExpr instanceof Alias as && as.child() instanceof SpatialAggregateFunction af) { | ||
| if (af.field() instanceof FieldAttribute fieldAttribute | ||
| && allowedForDocValues(fieldAttribute, ctx.searchStats(), agg, foundAttributes)) { | ||
| foundAttributes.add(fieldAttribute); |
There was a problem hiding this comment.
We are missing tests with geospatial aggregations I think, so this code path is never hit. Could we add one for the csv spec and one for the physical planning tests please?
There was a problem hiding this comment.
Actually, this was the previously supported case (before this PR), so there should be lots of tests for this, in both PhysicalPlanOptimizerTests and csv-spec tests. Only the former asserts that doc-values are used, while the latter asserts results are correct. In PhysicalPlanOptimizerTests I see 11 tests using ST_EXTENT_AGG and 9 tests using ST_CENTROID_AGG. These focused on this code path. The new behaviour added in this PR is that we now also use SORT (topN) and non-spatial aggregations to trigger this, and we only added three new tests:
- testSpatialTypesAndSortGeoGridUseDocValues (tests geogrid with SORT)
- testSpatialTypesAndStatsGeoGridUseDocValues (tests geogrid with COUNT aggregation)
- testSpatialTypesAndSortGeoGridUseDocValues2 (enhanced test with sort to test case where we drop or not drop the original spatial column, which affects the choice)
There were also lots older of csv-spec tests that tested spatial aggregations, as well as non-spatial aggregations (usually COUNT). This PR only added tests for geogrid cases, since that was not covered yet.
Are there specific combinations that you think I've missed? Or perhaps you would like to see more COUNT cases, since I only added one physical planner test for that scenario?
…8917) The ability to read `geo_point` from doc-values is a per-data-node decision, based on SearchStats on the shards on that data node. As such we can only perform FieldExtraction from doc-values if the geogrid function is run on the data node. This is currently only ensured when the function is before a pipeline breaker (like STATS or SORT). This means that this optimization only applies to queries where the geogrid function occurs before those commands. We test specifically queries like: ``` FROM airports | EVAL grid = ST_GEOTILE(location, 2) | STATS count=COUNT() BY grid ``` Will be optimized to read from doc-values. ``` FROM airports | EVAL grid = ST_GEOTILE(location, 2) | DROP location | SORT abbrev ``` Will also be optimized to read from doc-values. While a query like this will not be optimized: ``` FROM airports | EVAL grid = ST_GEOTILE(location, 2) ``` And neither will this one, because the original `location` is not dropped: ``` FROM airports | EVAL grid = ST_GEOTILE(location, 2) | SORT abbrev ``` Optimizing this requires ensuring the geogrid function is run on the data node. One way to do that will be to push the function into the field loader itself, which is something we plan to do when optimizing support for geo_shape loading, so will not be attempted in this PR.
Recent [optimizations in geo-grid functions](#138917) exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source). ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values. Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
Recent [optimizations in geo-grid functions](elastic#138917) exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source). ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values. Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
Recent [optimizations in geo-grid functions](#138917) exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source). ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values. Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
The ability to read
geo_pointfrom doc-values is a per-data-node decision, based on SearchStats on the shards on that data node. As such we can only perform FieldExtraction from doc-values if the geogrid function is run on the data node. This is currently only ensured when the function is before a pipeline breaker (like STATS or SORT). This means that this optimization only applies to queries where the geogrid function occurs before those commands. We test specifically queries like:Will be optimized to read from doc-values.
Will also be optimized to read from doc-values.
While a query like this will not be optimized:
And neither will this one, because the original
locationis not dropped:Optimizing this requires ensuring the geogrid function is run on the data node. One way to do that will be to push the function into the field loader itself, which is something we plan to do when optimizing support for geo_shape loading, so will not be attempted in this PR.