ESQL: Push more ==s on text fields to lucene#126641
Conversation
If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.
|
Hi @nik9000, I've created a changelog YAML for you. |
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/stats/SearchContextStats.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java
Outdated
Show resolved
Hide resolved
|
I don't believe this works for |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| if (query instanceof SingleValueQuery) { | ||
| // Already wrapped | ||
| return query; | ||
| } |
There was a problem hiding this comment.
I'm not super proud of this. I kind of thing we should remove this and have folks wrap the query they build themselves. But not in this PR.
There was a problem hiding this comment.
This exists so Equals can have a different behavior - it checks the value count of the synthetic source delegate.....
Wait. What if we remove one? Oh no.
There was a problem hiding this comment.
Ok. I've added a fix for this. I'll push a javadoc explaining it.
|
I think this is ready for a real review! |
|
@luigidellaquila could you have a look at this one too? |
I've added this test. |
| * </p> | ||
| * <p> | ||
| * You may be asking "how would the first {@code text_field.raw:foo} query work if the | ||
| * value we're searching for is very long? In that case we never use this query at all. |
There was a problem hiding this comment.
In that case we never use this query at all
Nit: I wonder if this should be a bigger warning at the top of the javdoc. I could imagine somebody (of us) trying to use this for something, and adding a bug because of it 👀 (Not perfect anyway)
| // end::rlikeEscapingTripleQuotes-result[] | ||
| ; | ||
|
|
||
| mvStringEquals |
There was a problem hiding this comment.
Do we have a test for an literal over ignored_above chars + MV?
There was a problem hiding this comment.
Literals don't have ignore_above.
| @timestamp:date ,message:text | ||
| 2023-10-23T13:55:01.543Z,[Connected to 10.1.0.1, Banana] | ||
| 2023-10-23T13:55:01.544Z,Connected to 10.1.0.1 | ||
| 2023-10-23T13:55:01.545Z,[Connected to 10.1.0.1, More than one hundred characters long so it isn't indexed by the sub keyword field with ignore_above:100] |
There was a problem hiding this comment.
What about adding also a single-value over ignore_above? So we have all the cases here
|
|
||
| @Override | ||
| public TransportVersion getMinimalSupportedVersion() { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Is this not serialized because it's always translated in the local node?
There was a problem hiding this comment.
Right! I'll leave a comment.
costin
left a comment
There was a problem hiding this comment.
LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always - how about using that instance directly in code instead of passing it around through TranslatorAware interface?
|
The serverless failure looks real. Digging into that. |
|
At the cost of basically an entire day I've discovered that the serverless
test failure had nothing to do with serverless. It's actually a bug with
the rewrite mechanism of our SingleValueMatchQuery - we think that the
query is match_all when it shouldn't be. I'm able to reproduce with an
index with two documents - one that contains "foo" and the other that
contains ["foo", "bar"]. To hit this you have to have the same number of
distinct terms as documents. If each doc has a distinct term we'd *correct*
rewrite this to match_none. But if there are duplicates we will *still*
rewrite it, this time incorrectly. I'll open a separate PR with this and
backport it.
RE the DEFAULT pushdown - it's not used in one critical place - during the
last layer of rewrites.
…On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***> wrote:
***@***.**** approved this pull request.
LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always
- how about using that instance directly in code instead of passing it
around through TranslatorAware interface?
—
Reply to this email directly, view it on GitHub
<#126641 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Also! It has to be on a comparison with `keyword` fields. Not number fields.
…On Mon, Apr 21, 2025 at 5:49 PM Nikolas Everett ***@***.***> wrote:
At the cost of basically an entire day I've discovered that the serverless
test failure had nothing to do with serverless. It's actually a bug with
the rewrite mechanism of our SingleValueMatchQuery - we think that the
query is match_all when it shouldn't be. I'm able to reproduce with an
index with two documents - one that contains "foo" and the other that
contains ["foo", "bar"]. To hit this you have to have the same number of
distinct terms as documents. If each doc has a distinct term we'd *correct*
rewrite this to match_none. But if there are duplicates we will *still*
rewrite it, this time incorrectly. I'll open a separate PR with this and
backport it.
RE the DEFAULT pushdown - it's not used in one critical place - during the
last layer of rewrites.
On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***>
wrote:
> ***@***.**** approved this pull request.
>
> LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always
> - how about using that instance directly in code instead of passing it
> around through TranslatorAware interface?
>
> —
> Reply to this email directly, view it on GitHub
> <#126641 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
While looking to extend this to |
The PR elastic#126641 has a bug with `!=`.
The PR #126641 has a bug with `!=`.
If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.
The PR elastic#126641 has a bug with `!=`.
|
Backporting with #128156 |
If you do:
we can't push to the text field because it's search index is for individual words. But most text fields have a
.keywordsub field and we can query it's index. EXCEPT! It's normal for these fields to haveignore_abovein their mapping. In that case we don't push to the field. Very sad.With this change we can push down
==, but only when the right hand side is shorter than theignore_above.This has pretty much infinite speed gain. An example using a million documents:
But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.