ESQL: Consider inlinestats when having field_caps check for field names#127564
ESQL: Consider inlinestats when having field_caps check for field names#127564astefan merged 16 commits intoelastic:mainfrom
Conversation
…field_names_for_inlinestats_fix
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @astefan, I've created a changelog YAML for you. |
| | LIMIT 3 | ||
| ; | ||
|
|
||
| abbrev:keyword | city:keyword | region:text | "COUNT(*)":long |
There was a problem hiding this comment.
Unrelated to this PR, but to previous work.
| assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V2.capabilityName())); | ||
| assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(JOIN_PLANNING_V1.capabilityName())); | ||
| assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V5.capabilityName())); | ||
| assumeFalse("INLINESTATS not yet supported in CCS", testCase.requiredCapabilities.contains(INLINESTATS_V7.capabilityName())); |
There was a problem hiding this comment.
V7 because I am trying to work on multiple separate issues. V6 should come from #127383
|
Hi @astefan, I've updated the changelog YAML for you. |
…astefan/elasticsearch into field_names_for_inlinestats_fix
…field_names_for_inlinestats_fix
|
Hi @astefan, I've updated the changelog YAML for you. |
…astefan/elasticsearch into field_names_for_inlinestats_fix
alex-spies
left a comment
There was a problem hiding this comment.
Thanks @astefan ! The fix works and the added tests are nice. I found 2 buggy queries, but they are likely unrelated to this PR's work.
I think this solution is okay, but I'd prefer to avoid adding more complexity to the fieldNames method by special-casing for INLINESTATS. The fact this PR is required is because we parse INLINESTATS as an InlineStats node containing an Aggregate child (containing, in turn, the previous commands as grand-ancestors). Therefore, I'd like to suggest another approach which changes how we represent a parsed INLINESTATS - see below.
There was a problem hiding this comment.
Heya, I tried some queries, trying to break things. I noticed 2 bugs which may or may not be related to this PR:
FROM hosts METADATA _index | eval x = ip1| INLINESTATS ip1 = COUNT(*) BY host_group, card| SORT ip1|LIMIT 1
gives an empty result, but removing the eval x = ip1 makes it work.
FROM hosts METADATA _index| INLINESTATS card = COUNT(*) BY card| SORT card|LIMIT 1
description | host | host_group | ip0 | ip1 | _index | card
---------------+---------------+---------------+---------------+---------------+---------------+---------------
alpha db server|alpha |DB servers |127.0.0.1 |127.0.0.1 |hosts |eth0
The card column has the wrong type, it should be a long - seems like we get the original index field here, instead.
There was a problem hiding this comment.
I've added this test to the suite. Data types are ok from my tests, there are other things wrong with that query. I've added details about the failure to the csv test suite.
| List<LogicalPlan> inlinestats = parsed.collect(InlineStats.class::isInstance); | ||
| Set<Aggregate> inlinestatsAggs = new HashSet<>(); | ||
| for (var i : inlinestats) { | ||
| inlinestatsAggs.add(((InlineStats) i).aggregate()); | ||
| } |
There was a problem hiding this comment.
The required solution here looks correct but confusing; this is because we parse INLINESTATS as an InlineStats node containing an Aggregate node as child, so we don't know for any given Aggregate if it's a STATS or an INLINESTATS, and the two have very different semantics.
I think we should rather just parse INLINESTATS as a single plan node - this would prevent this complexity.
Maybe consider refactoring the InlineStats node to avoid adding complexity here, as the fieldNames method is already hard to work with. A low effort fix would be to still have the InlineStats wrap an Aggregate, but not as its child - the actual child would be the preceding command.
There was a problem hiding this comment.
More generally, I wonder if there's an abstraction just around the corner that would make away with more special-casing inside this method.
In terms of the sets of attributes before and after INLINESTATS, it behaves similarly to EVAL, DISSECT, GROK, ENRICH and COMPLETION: some attributes are required because they are being referred to, some attributes are newly added and they shadow previous attributes. In the optimizer, we leverage this fact in the push down rules; for this, the plan nodes just need to implement the GeneratingPlan interface.
I think it'd be nice to move this method in a direction that would rely more on this general pattern.
That's out of scope for this PR, of course, but it'd also benefit from parsing INLINESTATS simply as 1 node rather than a combination of 2 nodes.
There was a problem hiding this comment.
Those are some good points (the use of GeneratingPlan and refactoring InlineStats), but I need more time to dig through these to prove these are valid changes to make. IMHO, the argument for simplifying what fieldNames is doing (looking at the aggregate inside an inlinestats) is not a strong one to warrant the refactoring. This change needs to be conceptually sound to make sense, ignoring the EsqlSession stuff.
Meaning, the conceptually sound argument needs to drive the refactoring and not the fact that fieldNames becomes more complex.
| | inlinestats max(salary) by l | ||
| | stats min = min(salary) by l | ||
| | eval x = min + 1 | ||
| | stats ca = count(*), cx = count(x) by l |
There was a problem hiding this comment.
I think the same behavior is expected when this stats is replaced by a keep x, l (no wildcard), right?
Maybe let's add such tests, and also some where the STATS or KEEP (no wildcard) comes before the INLINESTATS, for good measure.
…field_names_for_inlinestats_fix
bpintea
left a comment
There was a problem hiding this comment.
I agree with Alex's observation in general, but I think the fix as is is fine and contained. We can consider redesigning INLINESTATS flowingly (maybe considering the join it actually is).
| plan -> plan instanceof Project | ||
| || (plan instanceof Aggregate agg && (inlinestatsAggs.isEmpty() || inlinestatsAggs.contains(agg) == false)) |
There was a problem hiding this comment.
| plan -> plan instanceof Project | |
| || (plan instanceof Aggregate agg && (inlinestatsAggs.isEmpty() || inlinestatsAggs.contains(agg) == false)) | |
| plan -> plan instanceof Project | |
| || plan instanceof Aggregate agg && inlinestatsAggs.contains(agg) == false |
…field_names_for_inlinestats_fix
|
@elasticmachine run elasticsearch-ci/part-3 |
|
@elasticmachine run elasticsearch-ci/part-4 |
|
@elasticmachine run elasticsearch-ci/bwc-snapshots |
|
@elasticmachine run elasticsearch-ci/part-4 |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…es (elastic#127564) * Make inlinestats "transparent" to EsqlSession.fieldNames (cherry picked from commit 28b10c3)
The aggregate inside an inlinestats is "interfering" with the way field names are collected for field_caps requests. This made simple queries like
from test | inlinestats max(whatever) by groupto not return all fields fromtest, but to limit the resulting columns towhateverandgroup.inlinestats' purpose is to add columns to an already existent set of columns, which implies that this command has to be "transparent" to any wider collection of field names.Fixes #127236