Skip to content

Inconsistent Geospatial quantized results for point coordinates in ES|QL #139943

@craigtaverner

Description

@craigtaverner

ES|QL typically returns fields from doc-values, for performance reasons. However, for geospatial point data, this means a slight loss of precision, because geo_point and cartesian_point are quantized from two doubles (128bits) down to one long (64bits), in both doc-values and the lucene index (but not stored fields, or source). For all real-world use cases the remaining precision is fine, and something most users are willing to trade for the performance advantages. However, that willingness usually only extends to analytics, and if the user simply returns the original field, they usually want to see the exact original values. For this reason geospatial data is always returned from source in ES|QL, at a huge performance hit. However, we have implemented a number of optimizations that try to make use of doc-values whenever possible, and whenever the user does not return the original points so they will not see the precision loss. As we've expanded the scope of these optimizations, we've encountered a BWC issue with two particular functions, ST_X and ST_Y, and a less concerning issue with a group of related functions: ST_ENVELOPE, ST_XMAX, ST_XMIN, ST_YMAX and ST_YMIN. These are usually used with shapes, and so are not usually a concern, so we'll focus on ST_X and ST_Y because they are for points only.

In all releases of Elasticsearch since ST_X and ST_Y were introduced in 8.14 (as preview and GA in 8.17), these functions always returned the same values as source. However, in 9.3.0 we've introduced new geo-grid functions, which trigger a doc-values optimization leading these functions to sometimes return values from doc-values. This means the user can see slightly different results for the same data depending on whether an optimization was enabled or not. While this scenario is likely rare, it can occur, and while users might not notice, they could.

We suggest changing ST_X and ST_Y to always return quantized results. This means they will behave consistently, regardless of whether the underlying optimizations have been enabled or not. If the user wishes to see the original values at full precision, they can simply return the original field. Dropping the original field is actually necessary to trigger the optimization. With the approach, the only consequence of dropping the orginal field will be a performance boost, instead of the current behaviour of slightly changing the results of those functions.

The proposal is also consistent with the behaviour of other geospatial functions which always work with quantized results. For example, the ST_INTERSECTS function will do quantized intersections regardless whether the intersection was run within the Lucene index, or within the ES|QL compute engine. So the proposal here is to make geospatial quantization more consistent across all ES|QL spatial functions.

Related work:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions