Quantize ST_X, ST_Y and related functions#140963
Quantize ST_X, ST_Y and related functions#140963craigtaverner merged 14 commits intoelastic:mainfrom
Conversation
|
Hi @craigtaverner, I've created a changelog YAML for you. |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
d33a5f7 to
2e5ca74
Compare
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
🔍 Preview links for changed docs |
| @@ -137,6 +137,24 @@ protected NodeInfo<? extends Expression> info() { | |||
| } | |||
|
|
|||
| static void buildEnvelopeResults(BytesRefBlock.Builder results, Rectangle rectangle) { | |||
There was a problem hiding this comment.
Can we call this method buildCartesinanEnvelopeResults? it is confusing as it is now. And probably we can add a method that takes a encoder to avoid duplication.
There was a problem hiding this comment.
OK. The missing Cartesian problem was in many places (notably evaluators), so I fixed them all by doing a general cleanup making sure all evaluators have the correct Geo/Cartesian naming, as well as putting it always before the WKB/DocValues (some were after). I'll push this, and then work on reducing code duplication, which is also in more places than just this one method.
There was a problem hiding this comment.
OK. I've entirely removed duplicated code, by also using the pattern we used for sharing state in StGeotile. We create the visitor once per thread, instead of once per page, making fewer objects (the same optimization we used in the geo-grid functions). And I made sure we used the embedded SpatialCoordinateTypes inside the resultsBuilder, so we did not need multiple methods.
There was a problem hiding this comment.
And I was therefor also able to reduce the total number of generated evaluator classes. I imagine I could use this trick elsewhere to cut the generated code down a bit. But for now I think this is enough for this PR!
…cValues This increases similarity and consistency and reduces confusion
Also removed incorrect multiple quantization for doc-values results
We were using a supplier for the results builder, but mistakenly passing in a shared points visitor, instead of making a new one when the builder was initiated.
…the results builders
…ch into quantize_st_xy
ncordon
left a comment
There was a problem hiding this comment.
LGTM just left a couple of questions
| The spatial types `geo_point`, `geo_shape`, `cartesian_point` and `cartesian_shape` are maintained at source precision in the original documents, | ||
| but indexed at reduced precision by Lucene, for performance reasons. | ||
| To ensure this optimization is available in the widest context, all [spatial functions](/reference/query-languages/esql/functions-operators/spatial-functions.md) will produce results | ||
| at this reduced precision, aligned with the underlying Lucene index grid. |
There was a problem hiding this comment.
Are we explaining what's the Lucene grid in the docs somewhere before this?
There was a problem hiding this comment.
Good point. I understood we described this in the Query DSL and ingest docs, but should actually confirm this. And make links. I've discussed briefly with the docs team, and we want to re-review these docs next week and do a followup PR with any touch-ups.
| greaterThan(0) | ||
| ); | ||
| for (int column = 1; column < 8; column++) { | ||
| if (index > 0) { |
There was a problem hiding this comment.
I couldn't understand why this condition is here so I commented it out and tests pass. Why?
There was a problem hiding this comment.
The test is mostly comparing all indexes to index 0 (which is the fully indexed case). So uncommenting that line will compare index 0 to itself, which is fine, but redundant. See line 186 for where we get index 0 from as the baseline to compare against.
There was a problem hiding this comment.
But I see your point. This test was based on logic in other tests that have only one column and compare all indexes to the first index. But this test also compares other columns to column 0, and that comparison is valid for index 0 also. So that line should be removed, as you suggest. Pity this PR is already merged! A followup would make sense.
Since ST_X and ST_Y return slightly different precision in older versions, mixed clusters will return no results due to precision differences in the WHERE clause. We could consider relaxing equality for spatial types, but that is expensive (needs to convert WKB to Points, quantize and then back to WKB).
Recent [optimizations in geo-grid functions](elastic#138917) exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source). ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values. Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
Recent [optimizations in geo-grid functions](#138917) exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source). ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values. Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
Recent optimizations in geo-grid functions exposed some inconsistencies in other spatial functions that can return results either quantized (doc-values or lucene index) or not (source).
ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y (both GA), and a less concerning issue with a group of related functions (tech-preview): ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN.
This PR fixes the inconsistency, making ST_X, ST_Y and the envelope functions all produce quantized results, so optimizations do not result in different values.
Users can still get the original values with full prevision simply by not dropping the original geometry field, which will still be read from source. This gives users control over the precision-vs-performance lever. Just drop the original field to maximize performance, or keep it to see the original precision.
Fixes #139943