Support querying multiple indices with the simplified linear retriever#133720
Support querying multiple indices with the simplified linear retriever#133720ioanatia merged 10 commits intoelastic:mainfrom
Conversation
|
Hi @ioanatia, I've created a changelog YAML for you. |
| } | ||
| return innerRetrievers; | ||
| // there are no lexical fields that need to be queried, no need to create a retriever | ||
| if (lexicalQueryBuilders.isEmpty()) { |
There was a problem hiding this comment.
we could be in this situation when:
- No query fields are provided and we need to use the ones from
index.query.default_fieldindex setting and not all indices have the same value for this setting, resulting in different list of fields that need to be queried per index. This case is uncommon. - One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
From what I tested, the current approach where compose a boolean query and we filter by index names, seems to mitigate both cases.
If we find it too complicated, I can change this to raise an exception for both of these cases, so that we always use a single multi match query.
There was a problem hiding this comment.
One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.
Mikep86
left a comment
There was a problem hiding this comment.
Nice work! I reviewed just the core logic for now, not the tests. I have some suggestions about how we may be able to simplify and handle more edge cases at the same time.
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java
Outdated
Show resolved
Hide resolved
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java
Show resolved
Hide resolved
| } | ||
| return innerRetrievers; | ||
| // there are no lexical fields that need to be queried, no need to create a retriever | ||
| if (lexicalQueryBuilders.isEmpty()) { |
There was a problem hiding this comment.
One of the fields from the list that is provided to the linear retriever is a semantic_text field that is not present in all indices. This will be more common. It could be that a field with the same name does not even exist in all the other indices, or that it was mapped to another mapping type. We don't really know since we don't have access to the mappings.
Agreed this will be more common, enough so to the point that IMO we need to handle it. I think it's also worth pointing out that we can be in this case when differentNonInferenceFields == true and lexicalQueryBuilders is not empty.
...ugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/MultiFieldsInnerRetrieverUtils.java
Show resolved
Hide resolved
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
| assertFalse("the lexical retriever is only asserted once", assertedLexical); | ||
| assertFalse(expectedNonInferenceFields.isEmpty()); | ||
|
|
||
| QueryBuilder topDocsQueryBuilder = standardRetrieverBuilder.topDocsQuery(); |
There was a problem hiding this comment.
the reason why I went through these hoops for the lexical query is because we are comparing boolean queries.
and AFAICS I can't guarantee the order of the should clauses which are held in a List<QueryBuilder>.
when we compare two boolean queries, we check that we have the same list of should clauses:
so here I couldn't just use a simple assertEquals check as we had before
Mikep86
left a comment
There was a problem hiding this comment.
Looks good, awesome work 🚀 ! I left some minor comments about tests, they should be very easy to address.
| queryBuilder = new BoolQueryBuilder().must(queryBuilder).filter(new TermsQueryBuilder("_index", indices)); | ||
| } | ||
| return new InnerRetriever(new StandardRetrieverBuilder(queryBuilder), weight, expectedNormalizer); | ||
| }).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Can we check that rankWindowSize is propagated to the rewritten retriever as expected?
| MinMaxScoreNormalizer.INSTANCE, | ||
| DEFAULT_RANK_WINDOW_SIZE, | ||
| new float[0], | ||
| new ScoreNormalizer[0] |
There was a problem hiding this comment.
Nit: Use indexName instead of a magic string here? Repeat where "test-index" is referenced below.
| new float[0], | ||
| new ScoreNormalizer[0] | ||
| ); | ||
| assertMultiIndexMultiFieldsParamsRewrite( |
There was a problem hiding this comment.
Nit: Use anotherIndexName instead of a magic string here? Repeat where "test-another-index" is referenced below.
This only handles the linear retriever, but the same approach can be used for the RRF retriever.
I just haven't removed the restriction for RRF yet, we can do it in a follow up.
It should be as simple as removing the check we have for RRF and adding tests.
And once we do this change for both retrievers, we can adjust the docs too.
When we rewrite the linear retriever, we form two sub-retrievers:
The lexical query uses multi match at the moment.
For the semantic part, we group by (inferenceId, field_name) and issue a match query for each group.
For each group, if the semantic_text field does not exist in all queried indices, we add an additional filter on the index names.
Some examples:
Let's assume we have index A and index B and both have the same semantic text fields using the same inference IDs:
semantic_field_1andsemantic_field_2.And that we issue a query to query the
lexical_field_*,semantic_field_1,semantic_field_2.The linear retriever will be rewritten to:
Let's now assume that indexB has an extra
semantic_field_3and that we query:["lexical_field_*^3", "semantic_field_1^10", "semantic_field_2^20", "semantic_field_3^30"].The linear retriever will be rewritten to:
If only
semantic_textfield is queried, we don't need to have 2 levels of normalization, so we rewrite to: