ES|QL: Add initial grammar and planning for RRF (snapshot)#123396
ES|QL: Add initial grammar and planning for RRF (snapshot)#123396ioanatia merged 16 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @ioanatia, I've created a changelog YAML for you. |
ChrisHegarty
left a comment
There was a problem hiding this comment.
Overall this change look very concise and clean. I left a few small comments.
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RrfScoreEvalOperator.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java
Show resolved
Hide resolved
|
|
||
| int rank = counters.getOrDefault(fork, 1); | ||
| counters.put(fork, rank + 1); | ||
| scores.appendDouble(1.0 / (60 + rank)); |
There was a problem hiding this comment.
minor: this is currently configurable in _search, so we probably need to expose it as an option in the future here too
There was a problem hiding this comment.
You are right, we need to make the rank constant configurable.
This is added as a separate feature in the meta issue #123391
It will require a syntax change for RRF, so I'd like to keep it separate for now.
...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java
Outdated
Show resolved
Hide resolved
carlosdelest
left a comment
There was a problem hiding this comment.
Looks really good - I like the separation into individual pieces (dedup, operator, order).
I have some minor questions, and some error messages I think could be better
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RrfScoreEvalOperator.java
Outdated
Show resolved
Hide resolved
| from test | ||
| | rrf | ||
| """)); | ||
| assertThat(e.getMessage(), containsString("Unknown column [_score]")); |
There was a problem hiding this comment.
I think we should provide a better error message for missing metadata attrs- something like "_score is needed for using RRF. Please add METADATA _score to your FROM command".
There was a problem hiding this comment.
I looked into this - looks simple enough at a first glance. We can just modify this to have a custom error message when MetadataAttribute.isSupported(name) is true:
However we would return an error message like "Please add METADATA _score to your FROM command" even if you use ROW:
ROW a = 1, b = "two", c = null
| WHERE _score > 1
I know this is a very narrow corner case, but it would be an unintended behaviour.
It's not straighforward to get the context when we call UnresolvedAttribute.errorMessage whether the source command supports metadata attributes or not. So I think at most, we can look into this separately and not make the change here.
There was a problem hiding this comment.
I think it would be OK to error with "_score is needed for using RRF. Use FROM ... METADATA _score".
We can assume that full text search needs FROM, as FTFs need an index attribute to operate on?
We can refine this in a follow up, but it will be very confusing for users to receive "unknown column _score" - being a metadata attribute means users won't understand where's that coming from without referring to docs
There was a problem hiding this comment.
agreed - added as a follow up in #123391
| public void testRrfError() { | ||
| assumeTrue("requires RRF capability", EsqlCapabilities.Cap.FORK.isEnabled()); | ||
|
|
||
| var e = expectThrows(VerificationException.class, () -> analyze(""" |
There was a problem hiding this comment.
Shouldn't a explicit message like "FORK is needed before RRF" be added here so users have a clear understanding of the RRF usage?
There was a problem hiding this comment.
Added that in RrfScoreEval by implementing PostAnalysisVerificationAware.
However the check for unresolved attributes is done before the PostAnalysisVerificationAware checks.
I don't want to add a check just for RRF in the Verifier before we do the unresolved attributes check:
(planCheckers is looking for plans that implement PostAnalysisVerificationAware).
This deserves a bit more thought, so I am adding it as a follow up #123391
There was a problem hiding this comment.
That's reasonable. Thanks for looking into this.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java
Show resolved
Hide resolved
fang-xing-esql
left a comment
There was a problem hiding this comment.
Thank you @ioanatia, I added some questions about the usage of RRF that I can think of.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/rrf.csv-spec
Outdated
Show resolved
Hide resolved
|
|
||
| List<Order> order = List.of( | ||
| new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST), | ||
| new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST), |
There was a problem hiding this comment.
I wonder if there is a specific reason that we decide to sort on _id before _index? Or the order of these two fields doesn't matter?
There was a problem hiding this comment.
I don't think it matters - we just need a tiebreaker.
There was a problem hiding this comment.
I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.
| required_capability: match_operator_colon | ||
|
|
||
| FROM books METADATA _id, _index, _score | ||
| | FORK ( WHERE title:"Tolkien" | SORT _score DESC | LIMIT 3 ) |
There was a problem hiding this comment.
If we can have some queries with disjunctions in the where clause of each fork leg that will be great, just to add a bit more complexity to make sure it works as expected. There are some queries with disjunctions in the match function and operator's csvtests, that can be used as a reference.
| assertThat(e.getMessage(), containsString("Unknown column [_score]")); | ||
| assertThat(e.getMessage(), containsString("Unknown column [_fork]")); | ||
|
|
||
| e = expectThrows(VerificationException.class, () -> analyze(""" |
There was a problem hiding this comment.
I wonder if the sequence between FORK and RRF matters? For example if the sequence of fork and RRF is reversed, do we recognized it as a valid query?
| RRF
| FORK (WHERE a:"x")
(WHERE a:"y")
Do we allow multiple fork or RRF, like below? Do they make sense? ES|QL does not prevent multiple occurrence of the same processing commands, commands like where, eval etc. can be used multiple times in the same query, is this also true for RRF and fork?
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
or
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
(WHERE b:"y")
| RRF
There was a problem hiding this comment.
I have put a validation for RrfScoreEval such that we only allow RRF after a FORK command.
It might seem a bit extreme, but it makes sense in practice because while we might be able to execute the following queries, they don't make a lot of sense:
| RRF
| FORK (WHERE a:"x")
(WHERE a:"y")
or
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
Another thing to note is that we currently have a restriction for FORK where it's possible to only have a single FORK command in a query, so the following is not something we can do atm:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
(WHERE b:"y")
| RRF
There was a problem hiding this comment.
I did see one thing that was concerning when I tried to do:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
this would lead to an unexecutable query because when do the RRF planning this expands to:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RrfScoreEval
| Dedup
| Sort
| RrfScoreEval
| Dedup
| Sort
The first SORT does not have a LIMIT so it cannot be translated to a TOP N.
I need to think more about this, not about supporting the case where we do RRF after RRF, but how to avoid this case of having unexecutable queries - I added it as a follow in #123391
| | FORK ( WHERE emp_no:10001 ) | ||
| ( WHERE emp_no:10002 ) | ||
| | RRF | ||
| | EVAL _score = round(_score, 4) |
|
|
||
| List<Order> order = List.of( | ||
| new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST), | ||
| new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST), |
There was a problem hiding this comment.
I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.
Part of #123391 where we will keep track of any follow ups.
RRFis split into 3 parts:RrfScoreEvalreceives a discriminator column. It will assign a score for each row based on the position in the subset.Dedupis aSurrogateLogicalPlanthat expands intoSTATS _score =SUM(_score), field1 = VALUES(field1), field2=VALUES(field2), ... BY _id, _index, where:_score =SUM(_score)gives us the final RRF score_idand_indexfield1,field2... are the rest of the available columns that are not_score,_id,_indexand that we want to carry overSORT BY _score, _id, _index DESC- so that we return the sorted results; we use_idand_indexas a way to ensure the result order is deterministic (similar to what we do for_search).The
Dedupstep is the one that needs more consideration - at this stage I grouped by_idand_indexsince it was the easiest ATM, but ideally we might want to use an internal search ID (that's composed by the shard ID + doc ID).The other annoying aspect of grouping by
_idand_indexin the current implementation is that we require havingMETADATA _id, _index. It would be nice to evolve RRF to a place where this is not needed.