ESQL: Make field fusion generic#137382
Conversation
Speeds up queries like ``` FROM foo | STATS SUM(LENGTH(field)) ``` by fusing the `LENGTH` into the loading of the `field` if it has doc values. Running a fairly simple test: https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster. More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement `BlockLoaderExpression` on your expression and return non-null from `tryFuse`.
|
Hi @nik9000, I've created a changelog YAML for you. |
| * "fusing" the expression into the load. Or null if the fusion isn't possible. | ||
| */ | ||
| @Nullable | ||
| Fuse tryFuse(SearchStats stats); |
There was a problem hiding this comment.
Let's try to find another name - we already have Fuse as a command. ExpressionFieldLoader?
There was a problem hiding this comment.
Is FusedExpression ok? Or still too indicative?
There was a problem hiding this comment.
Naming... 😅
I come from staring at FUSE enough that it carries a lot of weight.
For me, this feature involves BlockLoaders. And Expressions that are applied to them. I understand that fuse means getting together those two, but it's not something I would think of immediately without more context.
I'd prefer to be overly explicit here, and call this BlockLoaderExpression or something similar that helps me bridge those two concepts together. But, naming...
| BlockLoaderExpression.Fuse fuse | ||
| ) { | ||
| // Only replace if exactly one side is a literal and the other a field attribute | ||
| if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) { |
There was a problem hiding this comment.
Nice! It's much better to let the Expression deal with the details and make this generic 👍
| */ | ||
| public boolean pushable() { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
This bothers me. I needed this because without it we'd try to push this:
FROM foo
| WHERE LENGTH(kwd) < 10
to the index. Now, we might be able to do that with a specialized lucene query. But we don't have one of those. Without those change instead what happens is:
LENGTH(kwd)becomes$$kwd$length$hash$.- We identify
$$kwd$length$hash$ < 10as pushable.
This tells us we can't push it. But it's kind of picky. If SearchStats took EsField it could check this easy enough. That might be a good solution to this.
There was a problem hiding this comment.
The MultiTypeEsField is created with aggregatable=false, so that predicates on it don't get pushed down incorrectly.
Adding pushable should also work.
There was a problem hiding this comment.
Adding
pushableshould also work.
I'm going to see if I can do aggregatable=false
There was a problem hiding this comment.
Just setting aggregatable to false doesn't do it. But I can return false from getExactInfo which seems to do the trick. I'm not entirely sure it's the best solution, but it doesn't invent a new thing.
There was a problem hiding this comment.
But! I'm not sure that's right either. exact seems to be a concept we use at type resolution time - but I'm not sure why. It's a left-over from old QL that had a more useful meaning there.
I wonder if it'd be better to keep pushable and maybe rename to existsInEsIndex or something.
There was a problem hiding this comment.
I've flipped this to using exact and that does seem to work. Not sure if I like it more.
...sql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalLogicalPlanOptimizerTests.java
Show resolved
Hide resolved
Adds special purpose `BlockLoader` implementations for the `MV_MIN` and `MV_MAX` functions for `keyword` fields with doc values. These are a noop for single valued keywords but should be *much* faster for multivalued keywords. These aren't plugged in yet. We can plug them in and performance test them in elastic#137382. And they give us two more functions we can use to demonstrate elastic#137382.
| } | ||
|
|
||
| public void testLengthInWhereAndEval() { | ||
| assumeFalse("fix me", true); |
There was a problem hiding this comment.
QL friends: This one looks fun!
There was a problem hiding this comment.
The reason that we get duplicated reference attributes here is that when PushExpressionsToFieldLoad creates a new FunctionEsField in EsRelation, it was generated under a specific command context, and it doesn't look at the the whole query plan level. So when the same LENGTH(last_name) is referenced in multiple commands in the query, duplicated FunctionEsFields are added into EsRelation.
ResolveUnionTypes has a very similar workflow. It iterates through the entire query plan to prepare the attributes added into EsRelation
There was a problem hiding this comment.
++, I'm rewriting this to look more like ResolveUnionTypes in #137392
There was a problem hiding this comment.
++, I'm rewriting this to look more like ResolveUnionTypes in #137392
Should I wait for you to do that rewrite before merging this PR? Or will should I merge first and then you'll fix it.
There was a problem hiding this comment.
Up to you! I'm addressing in #137564, but it still has to be reviewed. Feel free to merge this and I'll deal with integrating it.
server/src/main/java/org/elasticsearch/index/mapper/blockloader/BlockLoaderFunctionConfig.java
Outdated
Show resolved
Hide resolved
|
I am done with my first round of code review, overall looks pretty good! |
...sql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalLogicalPlanOptimizerTests.java
Show resolved
Hide resolved
I believe we already have these tests, but I'll double check and add a few more out of paranoia.
We have it for vectors. I'll look at it for string length. We'll want it for |
BASE=d657f7bef51da69d79134325ab5c3c5352ddf264 HEAD=05af8536e27b1e0c2d03d418fa19dc43f13b01e6 Branch=main
|
Hey folks. This is ready for another round. I'm going to add some more csv-spec tests this afternoon. I decided to go with @julian-elastic's first approach using the |
|
BASE=f08e7317360562458eec6fc609df81184ae53a9a HEAD=8504ed04897b23fa6781f37dd80f059965c6cd14 Branch=main
In elastic/elasticsearch#137382 we're pushing functions into field loading and using LENGTH as an example. This adds a rally track to demonstrate the performance difference: ``` | 90th | esql-avg-message-length | 11078 | 6670 | -4407.95 | ms | -39.79% | ```
In elastic/elasticsearch#137382 we're pushing functions into field loading and using LENGTH as an example. This adds a rally track to demonstrate the performance difference: ``` | 90th | esql-avg-message-length | 11078 | 6670 | -4407.95 | ms | -39.79% | ```
|
|
||
| Filter filter = as(eval2.child(), Filter.class); | ||
| And and = as(filter.condition(), And.class); | ||
| GreaterThan left = as(and.left(), GreaterThan.class); |
There was a problem hiding this comment.
I think it should be 2? I made one of the pushdowns on first name :) Just to let you know so you don't spend extra time debugging when working on #137679. You don't need to change to 2 right for this PR.
| ); | ||
| } | ||
|
|
||
| public void testLengthNotPushedToText() throws IOException { |
There was a problem hiding this comment.
Why can't this optimization work with Text?
There was a problem hiding this comment.
I'll push a comment to explain it but the sort version is that we haven't written the code yet. Text fields are loaded from _source and we've only implemented this optimization for loading from doc values. Worse, we've only implemented it for the particular kind of doc values that keyword uses. wildcard fields don't use the same encoding. We'd have to write another push down implementation for those.
julian-elastic
left a comment
There was a problem hiding this comment.
Looks good! Thank you for addressing my concerns! I left a few more small comments, but they can be addressed in the next PR.
|
I'll merge this now and open a follow up with some instructions and explanations based on @julian-elastic's last comments. |
Implements most remaining block loaders for MV_MIN and MV_MAX. Once #137382 is in we can push MV_MIN and MV_MAX into the block loaders for most field types. This is compelling it significantly reduces the amount of data loaded when using MV_MIN and MV_MAX.
Speeds up queries like
by fusing the
LENGTHinto the loading of thefieldif it has doc values. Running a fairly simple test:https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.
More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement
BlockLoaderExpressionon your expression and return non-null fromtryFuse.