ESQL: Pull OrderBy followed by InlineJoin on top of it#137648
ESQL: Pull OrderBy followed by InlineJoin on top of it#137648bpintea merged 10 commits intoelastic:mainfrom
OrderBy followed by InlineJoin on top of it#137648Conversation
Add optimisation rule to pull OrderBy above InlineJoin. This is required since otherwise the SORT won't be moved out of the left hand-side of the join, ending up as a stand-alone node that can't be turned into an executable -- it'll be rejected by verification already as an "unbounded sort". Since the InlineJoin is sort agnostic, the OrderBy can be moved upwards past it, ending up as a top TopN.
|
Hi @bpintea, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
astefan
left a comment
There was a problem hiding this comment.
I like the tests, and I always like more tests :-).
The fact that this PR pulls up OrderBy passed InlineJoin without "using" or be made aware by the existence of SortAgnostic (as opposed to the original PR which considered SortAgnostic for something related to OrderBy) shows that the presence of SortAgnostic is not well enforced in code (for the developer's visibility and awareness), something that (reading between lines) has been made clear here. In other words, SortAgnostic defies half of its purpose - to be a tool developers are aware of (or be made aware of reliably) when they develop new commands but also (my own addition here) when developers (new or seasoned ones) touch any planning code (not only related to the new command).
The other half of SortAgnostics purpose is well defined: TRYING to reveal issues with using a command that doesn't fully considers all the subtleties and implication of a sort before that command that might silently give wrong results. But, this holds true considering the following:
- the new command PR has good and, as much as possible, enough tests that reveal what
SortAgnosticis trying to guard - the new command PR is being reviewed by people who are aware of
sorttweaks andSortAgnosticpresence and can point out thatsortcan be something to look into further, something that is not always true these days
IMHO, just like my previous review to the original PR, is that this PR brings a much needed improvement to the intricacies of inline stats + sort usage: an irrelevant sort should be pulled up passed inline stats because inline stats right-hand side should use the "original" relevant set of data (and an unbounded sorted set of data is irrelevant) and should be considered for adoption.
I still need to dig through the surgical take from rewriteMidProjections method and see why it's so surgical :-), but this is for the second round of review.
| * <p> | ||
| * See also {@link PruneRedundantOrderBy}. | ||
| */ | ||
| public final class PullUpOrderByBeforeInlineJoin extends OptimizerRules.OptimizerRule<LogicalPlan> { |
There was a problem hiding this comment.
I think it makes sense for this rule to be OptimizerRules.CoordinatorOnly.
There was a problem hiding this comment.
Yes, I can't find a scenario where this wouldn't need to run on coordinator, but could run on the data node. Will update.
| */ | ||
| public void testInlineJoinPrunedAfterSortAndLookupJoin() { | ||
| var query = """ | ||
| FROM airports |
There was a problem hiding this comment.
This one is interesting in that if I add count to the keep I start getting Unbounded SORT not supported yet [SORT abbrev DESC] please add a LIMIT\nline 31:15: INLINE STATS [INLINE STATS count=COUNT(*) BY scalerank] cannot yet have an unbounded SORT [SORT abbrev DESC] before it : either move the SORT after it, or add a LIMIT before the SORT which might be correct, but it is confusing to the user. SORT abbrev was there before when count was not kept.
We eliminate the inline stats because we don't really need (and the sort is irrelevant from the point of view of inline stats) it but the outcome is definitely confusing to the user.
| 10029 |4 |null |74999 | ||
| ; | ||
|
|
||
| mixedShadowingInlineStatsAfterSort |
There was a problem hiding this comment.
I would add a variant of this test, as well:
FROM employees
| KEEP salary, emp_no, first_name, gender
| EVAL old_salary = salary
| SORT old_salary, emp_no
| DROP old_salary
| INLINE STATS salary = COUNT(*) BY gender
| LIMIT 5
There was a problem hiding this comment.
And another one:
FROM employees
| KEEP salary, emp_no, first_name, gender
| EVAL old_salary = salary
| SORT old_salary, emp_no
| DROP old_salary
| EVAL salary = salary / 1000
| RENAME salary AS s
| SORT s, first_name
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
| protected LogicalPlan rule(LogicalPlan plan) { | ||
| return plan.transformUp(LogicalPlan.class, PullUpOrderByBeforeInlineJoin::pullUpOrderByBeforeInlineJoin); | ||
| } | ||
|
|
||
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(LogicalPlan plan) { | ||
| if (plan instanceof InlineJoin inlineJoin) { | ||
| Holder<OrderBy> orderByHolder = new Holder<>(); | ||
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | ||
| if (node instanceof OrderBy orderBy) { | ||
| orderByHolder.set(orderBy); | ||
| breakEarly.set(true); | ||
| } else { | ||
| breakEarly.set(isSortBreaker(node)); | ||
| } | ||
| }); | ||
|
|
||
| OrderBy orderBy = orderByHolder.get(); | ||
| plan = orderBy == null ? plan : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | ||
| } | ||
| return plan; | ||
| } |
There was a problem hiding this comment.
| protected LogicalPlan rule(LogicalPlan plan) { | |
| return plan.transformUp(LogicalPlan.class, PullUpOrderByBeforeInlineJoin::pullUpOrderByBeforeInlineJoin); | |
| } | |
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(LogicalPlan plan) { | |
| if (plan instanceof InlineJoin inlineJoin) { | |
| Holder<OrderBy> orderByHolder = new Holder<>(); | |
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | |
| if (node instanceof OrderBy orderBy) { | |
| orderByHolder.set(orderBy); | |
| breakEarly.set(true); | |
| } else { | |
| breakEarly.set(isSortBreaker(node)); | |
| } | |
| }); | |
| OrderBy orderBy = orderByHolder.get(); | |
| plan = orderBy == null ? plan : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | |
| } | |
| return plan; | |
| } | |
| protected LogicalPlan rule(LogicalPlan plan) { | |
| return plan.transformUp(InlineJoin.class, ij -> pullUpOrderByBeforeInlineJoin(ij)); | |
| } | |
| private static LogicalPlan pullUpOrderByBeforeInlineJoin(InlineJoin inlineJoin) { | |
| Holder<OrderBy> orderByHolder = new Holder<>(); | |
| inlineJoin.forEachDownMayReturnEarly((node, breakEarly) -> { | |
| if (node instanceof OrderBy orderBy) { | |
| orderByHolder.set(orderBy); | |
| breakEarly.set(true); | |
| } else { | |
| breakEarly.set(isSortBreaker(node)); | |
| } | |
| }); | |
| OrderBy orderBy = orderByHolder.get(); | |
| return orderBy == null ? inlineJoin : pullUpOrderByBeforeInlineJoin(inlineJoin, orderBy); | |
| } |
There was a problem hiding this comment.
Oh yeh, fixed, thanks.
| return plan.transformUp(lp -> { | ||
| if (lp == eval) { | ||
| evalVisited.set(true); | ||
| } else if (lp instanceof Project project && evalVisited.get()) { |
There was a problem hiding this comment.
I didn't find a problem, but I wanted to check the aggregation (since it's considered to be a variant of projection) and I would like to propose unit test/csv test with (I know you have one unit test with sort ... stats ... inline stats):
FROM employees
| KEEP salary, emp_no, first_name, gender
| SORT salary
| STATS salary = MAX(salary) BY gender
| SORT salary DESC
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
and a slightly different variant:
FROM employees
| KEEP salary, emp_no, first_name, gender
| SORT salary
| STATS salary = MAX(salary) BY gender
| INLINE STATS s = COUNT(*) BY gender
| LIMIT 5
There was a problem hiding this comment.
Sure, added them (though the last one needs a trailing sort for stability in CSV tests (only)).
There was a problem hiding this comment.
Slightly superficial review, but I think the approach makes sense. There's very likely a hidden problem with projects that's worth looking into, even though it may not affect current production queries at this time. Update: There was no problem, my mistake.
And we need to align on opt-in vs. opt-out optimizer rules and the use of interfaces. For now, I'd really like us to use interfaces to opt-in to optimizer rules so that correctness is guaranteed rather than finding out later, when more commands get added to ESQL, that we ran queries that always returned wrong results to the user because a new command silently opted into an optimization (like hoisting a sort before inline stats while the new command is in between) that it doesn't actually work with.
At the very least, let's be consistent: we've already started using opt-in interfaces and have 3 (or more) rules that use them; having an opt-out list of commands as part of the optimizer rule just for 1 rule would be confusing and breaking the pattern.
| * <p> | ||
| * See also {@link PruneRedundantOrderBy}. | ||
| */ | ||
| public final class PullUpOrderByBeforeInlineJoin extends OptimizerRules.OptimizerRule<LogicalPlan> { |
There was a problem hiding this comment.
nit: we've started calling "push up" "hoist" in other rules.
There was a problem hiding this comment.
Noticed; not a fan. :)
We have preexisting "push down", not "lower" or "drop" and when listing the rules one would thus look up and grasp quicker "pull up" than "hoist" (which is also not "plain English").
But OK, since we have committed "hoist" already, I'm aligning this to it, but would still campaign on changing to "pull up" (and thus renaming the existing "hoist" rules, plus this one).
There was a problem hiding this comment.
Nothing against renaming "hoist" to "pullUp" or so. As long as we're consistent :)
| /** | ||
| * Returns `true` if the {@code plan}'s position cannot be swapped with a SORT, `false` otherwise. | ||
| */ | ||
| private static boolean isSortBreaker(LogicalPlan plan) { |
There was a problem hiding this comment.
I think this needs to become an interface; a list of instanceofs is too easy to go stale. It's also inconsistent that multiple optimizer rules already use interfaces that allow logical plan nodes to opt into the optimizations, but here we have an instanceof list instead.
E.g. Aggregate will become a sort breaker most likely once we support window functions; when someone who's unaware of this rule works on window functions, they are almost guaranteed to miss this and the problem will only surface through sufficient test coverage by humans - so the reviewer would have to keep this in mind and say "hey, but what about inline stats with window functions when there is an agg before the inline stats and before that there's a sort"? That makes reviews pretty hard.
I also think the interface should be one that opts into this optimizer rule, not one that opts out of it.
If we're worried about the zoo of interfaces being too large to manage by non-expert devs, we can have an abstract super class StandardLogicalPlan (probably with a much better name - RowStreamingLogicalPlan or so, for commands like Eval, Grok, Enrich etc.?) that extends all the opt-in interfaces that most plans would normally work with; and then new commands would generally just extend this class unless they're special enough.
There was a problem hiding this comment.
I think this needs to become an interface
I agree. :) This is why the discussion around SortAgnostic (which I was hoping to be able to use in this logic's 1st attempt and then in a reshape of that) and I suppose Andrei's comment above.
If we're worried about the zoo of interfaces being too large to manage by non-expert devs
I just hope we don't end up introducing interfaces that are only relevant for just one rule. I think that's not a great design and it'll have unpleasant consequences for most folks. (There's more to touch on here, we can do that in #137543.)
I'm adding a new interface here, but I hope we can fuse this somehow with SortAgnostic.
I also think the interface should be one that opts into this optimizer rule, not one that opts out of it.
These interfaces are basically boolean-stand-ins, so they're considered either way, but you probably suggest to make it "omission" safe (i.e. fail execution rather than wrong result). I've thus added a SortPreserving (in case one sort breaking new command wouldn't implement it, the rule wouldn't advance past such a command to fetch an OrderBy to pull up above InlineJoin and thus leave the query unexecutable, so that'd be safe w.r.t. correctness).
But I can also imagine the usages of these interfaces might end up having "conflicting" safety in different rules (i.e. one rule could be safe as an opt-in in one rule, but opt-out-safe in another). So we might need to put some additional guard-rails in place (possibly à la https://github.com/elastic/elasticsearch/pull/137543/files#diff-92b2b661404e647b0d14884daff5d4b666e17c994a02f3cae08f44303ed0dd7f).
There was a problem hiding this comment.
I just hope we don't end up introducing interfaces that are only relevant for just one rule.
I don't think there's an issue with interfaces that work with just one rule. But sure, if we can simplify by merging two interfaces together if they serve similar purposes, let's go :)
"omission" safe (i.e. fail execution rather than wrong result).
Yes, exactly! Thank you :)
i.e. one rule could be safe as an opt-in in one rule, but opt-out-safe in another
That's fair. I guess the general guidance should be "the optimizer rule takes a valid plan and produces a valid plan, even when ESQL evolves and we don't know all plan nodes in the query". Silently wrong results are just too dangerous, and we've seen with e.g. remote enrich planning how we can accidentally provide wrong results for queries that couldn't actually be executed correctly.
...java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PullUpOrderByBeforeInlineJoin.java
Outdated
Show resolved
Hide resolved
...java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PullUpOrderByBeforeInlineJoin.java
Outdated
Show resolved
Hide resolved
alex-spies
left a comment
There was a problem hiding this comment.
Thanks for the iteration, Bogdan! Looking even nicer now!
|
Thanks Andrei and Alex! |
Add optimisation rule to pull
OrderByaboveInlineJoin. This is required since otherwise theOrderBywon't be moved out of the left hand-side of the join, ending up as a stand-alone node that can't be turned into an executable -- it'll just be rejected by the verification as an "unbounded sort". Since theInlineJoinis sort agnostic, theOrderBycan be moved upwards past it, ending up as a topTopN.This doesn't entirely solve the issue of unbounded sorts, however: some nodes, such as
MV_EXPANDorLOOKUP JOINbreak the inbound sort, so pulling aSORTon top of them isn't possible. In this cases the query will still fail. This is notINLINE JOINspecific, though.This is a revival of #132417, but with a new approach on how to deal with the attributes used by
SORT, but dropped or overshadowed by the timeINLINE JOINis reached. In this case, we'll add temporary internal attributes kept until the INLINE JOIN and dropped afterwards.Related #124715 (#113727)
Related #124721 (#133120)