ES|QL: Fix wrong pruning of plans with no output columns#133405
ES|QL: Fix wrong pruning of plans with no output columns#133405luigidellaquila merged 36 commits intoelastic:mainfrom
Conversation
|
Hi @luigidellaquila, I've created a changelog YAML for you. |
| new PruneRedundantSortClauses(), | ||
| new PruneLeftJoinOnNullMatchingField() | ||
| new PruneLeftJoinOnNullMatchingField(), | ||
| new PruneEmptyAggregates() |
There was a problem hiding this comment.
We need this because our aggs don't know how to deal with no grouping and no aggs at the same time.
| // there cannot be an empty list of fields, we'll ask the simplest and lightest one instead: _index | ||
| return new PreAnalysisResult(enrichResolution, IndexResolver.INDEX_METADATA_FIELD, wildcardJoinIndices); | ||
| } else { | ||
| fieldNames.add(MetadataAttribute.INDEX); |
There was a problem hiding this comment.
This is the actual fix.
There was a problem hiding this comment.
Here I think it would be more correct to add this field after FieldNameUtils::withSubfields is called on the next line. _index.* doesn't actually make sense, as we know there is only on _index metadata field.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
ivancea
left a comment
There was a problem hiding this comment.
LGTM!
However, I would wait for Andrei/Alex to check it, as the _index fix was initially discarded in favor of a more complete approach when the issue was opened.
This PR covers more areas than the initial approach though, and initially this wasn't a common issue
x-pack/plugin/esql/qa/testFixtures/src/main/resources/drop.csv-spec
Outdated
Show resolved
Hide resolved
...c/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterLookupJoinIT.java
Show resolved
Hide resolved
…to esql/fix_no_columns
There was a problem hiding this comment.
I think it looks good. Some comments:
from employees | keep salary | eval c = 1 | drop c, salary
returns
{
"took": 29,
"is_partial": false,
"documents_found": 100,
"values_loaded": 0,
"columns": [],
"values": [
[],
[],
[],
[],
[],
[],
[],
[],
[],
....
This, to me at least (as a regular user of ESQL), seems surprising. This represents a bunch of "empty" virtual rows so to speak. This seems like an implementation detail that leaks into UX, I would have expected the "no columns and Xrows" use case to be handled in a more user friendly way. This also saves some KBs of comm data if potentially this is performed on a large data set.
Conceptually speaking, we are introducing the notion of "nothing/no columns + X rows", but the impact of this concept in the UX is a bit too intrusive imho.
- looking at the logs of this query
ProjectExec[[<all-fields-projected>{r$}#204]]
\_EvalExec[[null[NULL] AS <all-fields-projected>#204]]
\_EsQueryExec[employees], indexMode[standard], [_doc...
This "no columns + X rows" is represented as "" and it is a query that reaches ES in a particular way. I am wondering if this couldn't be, more performant, be implemented as a count, meaning we would ask from ES a count of rows and, potentially, on the coordinator node to "expand" this result in a "count" number of "empty" rows.
But this can regarded as food for thought for the future.
- I would add the following test, as well. This is, essentially, an empty result set but done with, also, dropping all columns:
from employees | keep salary | eval c = 1 | drop c, salary | where false
- another test(s) to add are those related to
inline statsbecause it is also using aggregates (which were changed as part of this PR).
A simple query, which adds stuff to rows, not reduces them (like stats does) is
from employees | keep salary | drop salary | eval x = 1 | inline stats count(), max(x) | limit 3
Another one with by as well: from employees | keep salary | drop salary | eval x = 1, y = 1 | inline stats count(), max(x) by y | limit 3
-
there is also the story around
drop *which is not allowed atm.
Should we allow it now, that we can handle a "drop all" scenario? Again, something to be put in a separate issue and have a discussion maybe about it. -
I would also add something that goes to more than one index:
from employees* | keep salary | drop salary | eval x = 1 | stats count()
- [LATER EDIT] Some tests that use
forkwould help:
from employees* | keep emp_no | drop emp_no | fork (stats c=count()) (stats c=count())
from employees* | keep emp_no | drop emp_no | fork (stats c=count()) (stats c=count()) | drop _fork*, c
from employees* | keep emp_no | drop emp_no | fork (stats c=count()) (stats c=count()) | drop _fork*, c | stats c=count()
| blocks[i++] = column.values(); | ||
| } | ||
| LocalSupplier supplier = LocalSupplier.of(blocks); | ||
| LocalSupplier supplier = LocalSupplier.of(blocks.length > 0 ? new Page(blocks) : new Page(0)); |
There was a problem hiding this comment.
I am wondering if we could integrate the logic "if the blocks size is 0 then new Page(0) otherwise new Page(blocks)" somehow with LocalSupplier.
There was a problem hiding this comment.
We don't want this in general, in some cases we want no blocks but new Page(N)
| fields.forEach(f -> values.add(f.child().fold(context.foldCtx()))); | ||
| var blocks = BlockUtils.fromListRow(PlannerUtils.NON_BREAKING_BLOCK_FACTORY, values); | ||
| return new LocalRelation(row.source(), row.output(), new CopyingLocalSupplier(blocks)); | ||
| return new LocalRelation(row.source(), row.output(), new CopyingLocalSupplier(blocks.length == 0 ? new Page(0) : new Page(blocks))); |
There was a problem hiding this comment.
Same here about this common logic where a Page is built this way.
|
|
||
| public CopyingLocalSupplier(Block[] blocks) { | ||
| delegate = new ImmediateLocalSupplier(blocks); | ||
| public CopyingLocalSupplier(Page page) { |
There was a problem hiding this comment.
The javadoc of this class needs an update given the change of the constructor?
|
|
||
| import java.util.List; | ||
|
|
||
| public final class PruneEmptyAggregates extends OptimizerRules.OptimizerRule<Aggregate> { |
There was a problem hiding this comment.
It would help to have a javadoc on this rule. I mean, it is obvious what it does, but it would be helpful to read about the situations that lead to an Aggregate with no aggregates and no groupings.
| // for the moment support pushing count just for one field | ||
| List<EsStatsQueryExec.Stat> stats = tuple.v2(); | ||
| if (stats.size() > 1) { | ||
| if (stats.size() != 1) { |
There was a problem hiding this comment.
It's because now all the stats can be pruned, so also the case with size == 0 has to be taken into consideration
| // there cannot be an empty list of fields, we'll ask the simplest and lightest one instead: _index | ||
| return new PreAnalysisResult(enrichResolution, IndexResolver.INDEX_METADATA_FIELD, wildcardJoinIndices); | ||
| } else { | ||
| fieldNames.add(MetadataAttribute.INDEX); |
There was a problem hiding this comment.
Here I think it would be more correct to add this field after FieldNameUtils::withSubfields is called on the next line. _index.* doesn't actually make sense, as we know there is only on _index metadata field.
| ChangePointGenerator.INSTANCE, | ||
| DissectGenerator.INSTANCE, | ||
| DropGenerator.INSTANCE, | ||
| DropAllGenerator.INSTANCE, |
|
Thanks for the review @astefan, I just pushed a commit that incorporate your suggestions. Please let me know if you have further remarks or if you think it's worth discussing it in a wider audience. When it's done, I'll open two follow-up issues, one to allow |
astefan
left a comment
There was a problem hiding this comment.
LGTM. Thanks @luigidellaquila
Fixes: #120272
Fixing queries where all the columns from the original index are projected out.
The expected outcome is that the number of rows is preserved
This means that a query like
that now returns nothing
will start returning values:
This will also apply to aggregations, eg.
that now returns
0, will start returning the actual count of the rows (3in this case)I don't know if it's breaking (IMHO it's a bug, but it still changes the behavior).
TODO