ESQL: remove incorrect inline stats pruning#141056
ESQL: remove incorrect inline stats pruning#141056elasticsearchmachine merged 14 commits intoelastic:mainfrom
Conversation
3739cc4 to
ab355ef
Compare
|
Hi @astefan, I've created a changelog YAML for you. |
…139359_fix_take2
…search into 139359_fix_take2
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| 347.99 | ||
| ; | ||
|
|
||
| inlineStatsGroupByNull_PruneDifferentInlineStats |
There was a problem hiding this comment.
This test covers both the bug fix here but also it's extended following #140027 merge
|
|
||
| public class PruneColumnsTests extends AbstractLogicalPlanOptimizerTests { | ||
|
|
||
| public void testPruneUnusedEval() { |
There was a problem hiding this comment.
The start of this file contains already existent tests from LogicalPlanOptimizerTests. There is another comment down below with the new tests pertaining to this PR fix only.
| } | ||
| } | ||
|
|
||
| /* |
There was a problem hiding this comment.
Tests relevant for this PR's fix.
|
|
||
| // Left branch: Project with avg, decades, idecades | ||
| var leftProject = as(inlineJoin.left(), Project.class); | ||
| assertThat(Expressions.names(leftProject.projections()), is(List.of("avg", "decades"))); |
There was a problem hiding this comment.
Here idecades has been projected away from the original version of this test method.
There was a problem hiding this comment.
The comment could be updated too.
alex-spies
left a comment
There was a problem hiding this comment.
Thanks a lot @astefan !
As discussed before, I can't currently give this a deep review, but the approach looks good to me and I'm happy about the added spec tests :)
| var right = pruneColumns(ij.right(), used, true); | ||
| if (right.output().isEmpty() || isLocalEmptyRelation(right)) { | ||
|
|
||
| if (right.outputSet().subtract(ij.references()).isEmpty() || isLocalEmptyRelation(right)) { |
There was a problem hiding this comment.
nit: I'd add a comment: "ij.references() are the join keys - the inline join doesn't add new columns. Since it preserves rows, it doesn't do anything and can be pruned."
There was a problem hiding this comment.
Would this only happen when we are shadowing variables in the inline stats outside?
Example:
FROM employees
| KEEP emp_no, salary, birth_date
| EVAL orig_emp_no = emp_no
| INLINE STATS emp_no = MAX(birth_date) BY salary
| EVAL emp_no = birth_date
Or are there more cases?
There was a problem hiding this comment.
Shadowing and DROP'ping. Maybe @astefan is aware of more cases.
There was a problem hiding this comment.
This is about all use cases where what inline stats creates (meaning the aggs themselves) are "missing" after the inline stats command. This can be drop, keep [other columns and not the ones created by inline stats], eval, another stats (maybe others).
The other scenario (|| isLocalEmptyRelation(right)) comes from scenarios where the output of the right hand side is completely empty, meaning we don't need anything from what inline stats is using. In your specific query, for example, if I would do drop emp_no, salary then the inline stats itself can be pruned entirely. And I think that PruneEmptyPlans does that by creating a LocalRelation with an EMPTY supplier and the inline stats command itself gets to be pruned here.
There was a problem hiding this comment.
The only subtlety is that even a pruned inline stats affects the column order if there is a BY clause and the grouping does not get dropped:
FROM employees
| KEEP emp_no, languages, salary
| INLINE STATS s = max(salary) BY languages
| DROP s
-> output:
emp_no | salary | languages
The INLINE STATS moves the groupings to the right hand side, compared to just
FROM employees
| KEEP emp_no, languages, salary
| * \_Filter[salary{f}#16 < 10000[INTEGER] AND salary{f}#16 > 10000[INTEGER]] | ||
| * \_EsRelation[employees][_meta_field{f}#17, emp_no{f}#11, first_name{f}#12, ..] | ||
| */ | ||
| public void testPushDown_ImpossibleFilter_PastInlineJoin() { |
There was a problem hiding this comment.
Btw, @GalLalouche 's golden tests can be used now - removing the toil required for writing/updating all the assertions below. It's a great time saver :)
There was a problem hiding this comment.
I am an adept of seeing and checking what I see in the same file :-).
There was a problem hiding this comment.
With #141082, the query that is being checked also gets put into the dir with the expected outcome, so you can just zip through the files during review. Worked great for me.
| | INLINE STATS salary = AVG(salary) BY salary | ||
| | WHERE salary > 10000 |
There was a problem hiding this comment.
That's a sneaky one. Nice test!
There was a problem hiding this comment.
Adding here the reproducer from the duplicated PR, as it comes from a different machanism (constant false WHERE in agg, ReplaceStatsFilteredOrNullAggWithEval):
FROM employees
| INLINE STATS x = MAX(salary) WHERE false, c = COUNT(*) BY emp_no
| KEEP x, c
| SORT x, c
| LIMIT 3
;
x:integer | c:long
null | 1
null | 1
null | 1
;
bpintea
left a comment
There was a problem hiding this comment.
What a nice simplification! Thanks Andrei.
ncordon
left a comment
There was a problem hiding this comment.
I left some questions, nothing major, it's mainly for my own understanding.
The change makes sense from my viewpoint because of two reasons: we are having into account variables that get used inside the right part of inline join that we didn't before and we are being less aggressive when pruning.
| } | ||
|
|
||
| /* | ||
| * https://github.com/elastic/elasticsearch/issues/138283 |
There was a problem hiding this comment.
Should we delete this line?
| var right = pruneColumns(ij.right(), used, true); | ||
| if (right.output().isEmpty() || isLocalEmptyRelation(right)) { | ||
|
|
||
| if (right.outputSet().subtract(ij.references()).isEmpty() || isLocalEmptyRelation(right)) { |
There was a problem hiding this comment.
Would this only happen when we are shadowing variables in the inline stats outside?
Example:
FROM employees
| KEEP emp_no, salary, birth_date
| EVAL orig_emp_no = emp_no
| INLINE STATS emp_no = MAX(birth_date) BY salary
| EVAL emp_no = birth_date
Or are there more cases?
| * | \_EsRelation[union_types_index*][!first_name, id{f}#11, !languages, !last_name, !sal..] | ||
| * \_EsRelation[languages_lookup][LOOKUP][language_code{f}#13] | ||
| */ | ||
| public void testPruneColumnsInProject_WithLookupJoin() { |
There was a problem hiding this comment.
Should we add some csv specs for these as well?
There was a problem hiding this comment.
The csv spec tests are not relevant for the functionality of this PR change when it comes to Project pruning of union types related fields (see more here). Those fields were already not present in the output of these types of queries. What this PR does is something "invisible" in the output, meaning fields like !first_name, !languages, !last_name are not present in the Project sitting atop of EsRelation on the right-hand side of the LookupJoin node. Before this PR that Project had wrongly projecting those fields as well.
There was a problem hiding this comment.
If you look closely at these tests, they are a bit different from the rest of tests regarding LogicalPlanOptimizer: they are directly calling PruneColumns rule (vs. running the full list of optimization rules batches).
| var eval = as(extract.child(), EvalExec.class); | ||
| var source = as(eval.child(), EsQueryExec.class); | ||
| var source = as(extract.child(), EsQueryExec.class); |
There was a problem hiding this comment.
Specifically I don't understand why the INLINE STATS prunning or the Project prunning affect this test
There was a problem hiding this comment.
Because _EvalExec[[null[INTEGER] AS languages#21]] is not needed anymore, since it is not used afterwards. languages#21 is projected with \_ProjectExec[[emp_no{f}#18, languages{r}#21 AS language_code#7, first_name{f}#19]] but afterwards, languages#21 and/or language_code#7 is not re-used anywhere else. It is overiden by language_code{r}#12 that grok creates.
💔 Backport failed
You can use sqren/backport to manually backport by running |
Removes incorrect pruning in `inline stats` that lacked taking into consideration the fact that groupings can be multi-valued. Fixes elastic#139359 (comment) (grouping by an mv fields can lead to inconsistencies in certain queries) Fixes elastic#139359 Fixes elastic#140757 This also extracts a bunch of tests methods from the huge `LogicalPlanOptimizerTests` and moves them to a new test class - `PruneColumnsTests` which should include most of the tests that have in the spotlight the `PruneColumns` optimization rule. (cherry picked from commit 244538b)
* ESQL: remove incorrect inline stats pruning (#141056) Removes incorrect pruning in `inline stats` that lacked taking into consideration the fact that groupings can be multi-valued. Fixes #139359 (comment) (grouping by an mv fields can lead to inconsistencies in certain queries) Fixes #139359 Fixes #140757 This also extracts a bunch of tests methods from the huge `LogicalPlanOptimizerTests` and moves them to a new test class - `PruneColumnsTests` which should include most of the tests that have in the spotlight the `PruneColumns` optimization rule. (cherry picked from commit 244538b) * Update * Adapt the fix to 9.3.0
Removes incorrect pruning in
inline statsthat lacked taking into consideration the fact that groupings can be multi-valued.Fixes #139359 (comment) (grouping by an mv fields can lead to inconsistencies in certain queries)
Fixes #139359
Fixes #140757
This also extracts a bunch of tests methods from the huge
LogicalPlanOptimizerTestsand moves them to a new test class -PruneColumnsTestswhich should include most of the tests that have in the spotlight thePruneColumnsoptimization rule.