ESQL: Push more `==`s on text fields to lucene by nik9000 · Pull Request #126641 · elastic/elasticsearch

nik9000 · 2025-04-10T20:06:46Z

If you do:

| WHERE text_field == "cat"

we can't push to the text field because it's search index is for individual words. But most text fields have a .keyword sub field and we can query it's index. EXCEPT! It's normal for these fields to have ignore_above in their mapping. In that case we don't push to the field. Very sad.

With this change we can push down ==, but only when the right hand side is shorter than the ignore_above.

This has pretty much infinite speed gain. An example using a million documents:

Before:  "took" : 391,
 After:  "took" :   4,

But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.

If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.

elasticsearchmachine · 2025-04-10T20:07:19Z

Hi @nik9000, I've created a changelog YAML for you.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/stats/SearchContextStats.java

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

…ush_text_sub

nik9000 · 2025-04-14T17:25:15Z

I don't believe this works for != but I'll open a followup that should handle that.

elasticsearchmachine · 2025-04-14T17:41:31Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2025-04-14T17:42:31Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/TranslatorHandler.java

+        if (query instanceof SingleValueQuery) {
+            // Already wrapped
+            return query;
+        }


I'm not super proud of this. I kind of thing we should remove this and have folks wrap the query they build themselves. But not in this PR.

This exists so Equals can have a different behavior - it checks the value count of the synthetic source delegate.....

Wait. What if we remove one? Oh no.

Ok. I've added a fix for this. I'll push a javadoc explaining it.

I pushed 600257a.

…ush_text_sub

nik9000 · 2025-04-14T18:56:11Z

I think this is ready for a real review!

…ush_text_sub

nik9000 · 2025-04-18T18:02:34Z

@luigidellaquila could you have a look at this one too?

…ush_text_sub

nik9000 · 2025-04-21T13:13:11Z

I owe @luigidellaquila a test for pushing to lowercase. It's almost done. Just running a local test run one last time.

I've added this test.

ivancea

LGTM!

ivancea · 2025-04-21T15:12:59Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/querydsl/query/SingleValueQuery.java

+     * </p>
+     * <p>
+     *     You may be asking "how would the first {@code text_field.raw:foo} query work if the
+     *     value we're searching for is very long? In that case we never use this query at all.


In that case we never use this query at all

Nit: I wonder if this should be a bigger warning at the top of the javdoc. I could imagine somebody (of us) trying to use this for something, and adding a bug because of it 👀 (Not perfect anyway)

ivancea · 2025-04-21T15:20:47Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/string.csv-spec

 // end::rlikeEscapingTripleQuotes-result[]
 ;
+
+mvStringEquals


Do we have a test for an literal over ignored_above chars + MV?

Literals don't have ignore_above.

ivancea · 2025-04-21T15:21:59Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/data/mv_text.csv

+@timestamp:date         ,message:text
+2023-10-23T13:55:01.543Z,[Connected to 10.1.0.1, Banana]
+2023-10-23T13:55:01.544Z,Connected to 10.1.0.1
+2023-10-23T13:55:01.545Z,[Connected to 10.1.0.1, More than one hundred characters long so it isn't indexed by the sub keyword field with ignore_above:100]


What about adding also a single-value over ignore_above? So we have all the cases here

Yeah. I should do that.

ivancea · 2025-04-21T15:32:03Z

...src/main/java/org/elasticsearch/xpack/esql/querydsl/query/EqualsSyntheticSourceDelegate.java

+
+        @Override
+        public TransportVersion getMinimalSupportedVersion() {
+            throw new UnsupportedOperationException();


Is this not serialized because it's always translated in the local node?

Right! I'll leave a comment.

costin

LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always - how about using that instance directly in code instead of passing it around through TranslatorAware interface?

nik9000 · 2025-04-21T17:01:23Z

The serverless failure looks real. Digging into that.

nik9000 · 2025-04-21T21:50:08Z

At the cost of basically an entire day I've discovered that the serverless test failure had nothing to do with serverless. It's actually a bug with the rewrite mechanism of our SingleValueMatchQuery - we think that the query is match_all when it shouldn't be. I'm able to reproduce with an index with two documents - one that contains "foo" and the other that contains ["foo", "bar"]. To hit this you have to have the same number of distinct terms as documents. If each doc has a distinct term we'd *correct* rewrite this to match_none. But if there are duplicates we will *still* rewrite it, this time incorrectly. I'll open a separate PR with this and backport it. RE the DEFAULT pushdown - it's not used in one critical place - during the last layer of rewrites.

…

On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***> wrote: ***@***.**** approved this pull request. LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always - how about using that instance directly in code instead of passing it around through TranslatorAware interface? — Reply to this email directly, view it on GitHub <#126641 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nik9000 · 2025-04-21T21:57:53Z

Also! It has to be on a comparison with `keyword` fields. Not number fields.

…

On Mon, Apr 21, 2025 at 5:49 PM Nikolas Everett ***@***.***> wrote: At the cost of basically an entire day I've discovered that the serverless test failure had nothing to do with serverless. It's actually a bug with the rewrite mechanism of our SingleValueMatchQuery - we think that the query is match_all when it shouldn't be. I'm able to reproduce with an index with two documents - one that contains "foo" and the other that contains ["foo", "bar"]. To hit this you have to have the same number of distinct terms as documents. If each doc has a distinct term we'd *correct* rewrite this to match_none. But if there are duplicates we will *still* rewrite it, this time incorrectly. I'll open a separate PR with this and backport it. RE the DEFAULT pushdown - it's not used in one critical place - during the last layer of rewrites. On Mon, Apr 21, 2025 at 12:39 PM Costin Leau ***@***.***> wrote: > ***@***.**** approved this pull request. > > LGTM. It looks to me that LucenePushdownPredicates.DEFAULT is used always > - how about using that instance directly in code instead of passing it > around through TranslatorAware interface? > > — > Reply to this email directly, view it on GitHub > <#126641 (review)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABUXISSP4NNO6BDQQYWLY322UNNXAVCNFSM6AAAAAB24RI6WCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOOBRG42DCNBWG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

nik9000

Here's the fix for the serverless issue: #127146

nik9000 · 2025-04-22T21:40:53Z

While looking to extend this to != I've discovered a bug where this PR as it stands changes the behavior of != so I'll back it out. I'll re-add this when I have a solution to both. Back out PR incoming.

The PR elastic#126641 has a bug with `!=`.

The PR #126641 has a bug with `!=`.

If you do: ``` | WHERE text_field == "cat" ``` we can't push to the text field because it's search index is for individual words. But most text fields have a `.keyword` sub field and we *can* query it's index. EXCEPT! It's normal for these fields to have `ignore_above` in their mapping. In that case we don't push to the field. Very sad. With this change we can push down `==`, but only when the right hand side is shorter than the `ignore_above`. This has pretty much infinite speed gain. An example using a million documents: ``` Before: "took" : 391, After: "took" : 4, ``` But this is going from totally un-indexed linear scans to totally indexed. You can make the "Before" number as high as you want by loading more data.

The PR elastic#126641 has a bug with `!=`.

nik9000 · 2025-05-19T18:36:23Z

Backporting with #128156

nik9000 added >enhancement :Analytics/ES|QL AKA ESQL v9.1.0 labels Apr 10, 2025

Update docs/changelog/126641.yaml

c6bb228

idegtiarenko reviewed Apr 11, 2025

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/stats/SearchContextStats.java Show resolved Hide resolved

idegtiarenko approved these changes Apr 11, 2025

View reviewed changes

craigtaverner reviewed Apr 11, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java Outdated Show resolved Hide resolved

nik9000 added 4 commits April 11, 2025 15:30

Merge branch 'main' into esql_push_text_sub

523321a

Fix off by one

a5e2206

Merge remote-tracking branch 'nik9000/esql_push_text_sub' into esql_p…

db25092

…ush_text_sub

Merge branch 'main' into esql_push_text_sub

f5ce702

Proper delegate

9bd4b89

nik9000 marked this pull request as ready for review April 14, 2025 17:41

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 14, 2025

nik9000 commented Apr 14, 2025

View reviewed changes

nik9000 and others added 5 commits April 14, 2025 14:37

Fix delegate for ignored fields

42b7a20

fmt

a2455e9

[CI] Auto commit changes from spotless

267632b

test!

254158b

Merge remote-tracking branch 'nik9000/esql_push_text_sub' into esql_p…

f872334

…ush_text_sub

elasticsearchmachine and others added 5 commits April 14, 2025 19:04

[CI] Auto commit changes from spotless

1b34eb2

Explain

600257a

Merge remote-tracking branch 'nik9000/esql_push_text_sub' into esql_p…

3910ae9

…ush_text_sub

Fix test

885598f

Merge branch 'main' into esql_push_text_sub

a6c7596

nik9000 added 3 commits April 18, 2025 11:05

Move test

4c082e9

Merge branch 'main' into esql_push_text_sub

6f4e5eb

Merge branch 'main' into esql_push_text_sub

deab4be

nik9000 requested a review from luigidellaquila April 18, 2025 18:03

nik9000 added 2 commits April 21, 2025 09:12

Merge branch 'main' into esql_push_text_sub

faa245d

Merge remote-tracking branch 'nik9000/esql_push_text_sub' into esql_p…

4a02fe1

…ush_text_sub

Revert things we don't need

0f9b54f

ivancea approved these changes Apr 21, 2025

View reviewed changes

Merge branch 'main' into esql_push_text_sub

4125263

costin approved these changes Apr 21, 2025

View reviewed changes

Updates

ac558f3

nik9000 commented Apr 22, 2025

View reviewed changes

nik9000 mentioned this pull request Apr 22, 2025

keyword search in ESQL is too slow compared to KQL/DSL #104517

Closed

Merge branch 'main' into esql_push_text_sub

5e8430b

nik9000 enabled auto-merge (squash) April 22, 2025 18:24

nik9000 merged commit b527e4b into elastic:main Apr 22, 2025
16 of 17 checks passed

nik9000 mentioned this pull request Apr 22, 2025

ESQL: In main <text_field> != will incorrectly match long strings #127197

Closed

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Apr 22, 2025

ESQL: Disable a bugged commit

7b77c9a

The PR elastic#126641 has a bug with `!=`.

nik9000 mentioned this pull request Apr 22, 2025

ESQL: Disable a bugged commit #127199

Merged

nik9000 added a commit that referenced this pull request Apr 23, 2025

ESQL: Disable a bugged commit (#127199)

ef0a177

The PR #126641 has a bug with `!=`.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 19, 2025

ESQL: Disable a bugged commit (elastic#127199)

40000c7

The PR elastic#126641 has a bug with `!=`.

nik9000 added the v8.19.0 label May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Push more `==`s on text fields to lucene#126641

ESQL: Push more `==`s on text fields to lucene#126641
nik9000 merged 33 commits intoelastic:mainfrom
nik9000:esql_push_text_sub

nik9000 commented Apr 10, 2025

elasticsearchmachine commented Apr 10, 2025

Uh oh!

Uh oh!

nik9000 commented Apr 14, 2025

elasticsearchmachine commented Apr 14, 2025

nik9000 Apr 14, 2025

nik9000 Apr 14, 2025

nik9000 Apr 14, 2025

nik9000 Apr 14, 2025

nik9000 commented Apr 14, 2025

nik9000 commented Apr 18, 2025

nik9000 commented Apr 21, 2025

ivancea left a comment

ivancea Apr 21, 2025

ivancea Apr 21, 2025

nik9000 Apr 21, 2025

ivancea Apr 21, 2025

nik9000 Apr 21, 2025

ivancea Apr 21, 2025

nik9000 Apr 21, 2025

costin left a comment

nik9000 commented Apr 21, 2025

nik9000 commented Apr 21, 2025 via email

nik9000 commented Apr 21, 2025 via email

nik9000 left a comment

Uh oh!

nik9000 commented Apr 22, 2025

nik9000 commented May 19, 2025

Labels

6 participants

Conversation

nik9000 commented Apr 10, 2025

elasticsearchmachine commented Apr 10, 2025

Uh oh!

Uh oh!

nik9000 commented Apr 14, 2025

elasticsearchmachine commented Apr 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Apr 14, 2025

nik9000 commented Apr 18, 2025

nik9000 commented Apr 21, 2025

ivancea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin left a comment

Choose a reason for hiding this comment

nik9000 commented Apr 21, 2025

nik9000 commented Apr 21, 2025 via email

nik9000 commented Apr 21, 2025 via email

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 22, 2025

nik9000 commented May 19, 2025

Labels

6 participants