Skip to content

Fix for from parameter when using sub_searches and rank#106253

Merged
pmpailis merged 24 commits intoelastic:mainfrom
pmpailis:fix/99011_from_size_values_fix_for_subsearches_rank
Apr 25, 2024
Merged

Fix for from parameter when using sub_searches and rank#106253
pmpailis merged 24 commits intoelastic:mainfrom
pmpailis:fix/99011_from_size_values_fix_for_subsearches_rank

Conversation

@pmpailis
Copy link
Contributor

It seems that pagination does not work as expected for rrf searches, so in this PR we update the RRFRankCoordinatorContext to simply skip the first from results from the combined top ranked docs prior to generating the final list to return. This has the following implications:

  • We must make sure that window_size >= (from + size) to make sure that we have enough docs to rank and do the necessary pagination. (maybe we should also enforce an upper limit on window_size ? 🤔 )
  • All pagination will be handled by the ranker on the coordinator node prior to exiting the reducedQueryPhase . Once we're done with RRFRankCoordinatorContext#rank we reset from to be 0 and proceed as usual.

Closes #99011

@pmpailis pmpailis added >bug :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v8.14.0 labels Mar 12, 2024
@pmpailis pmpailis requested a review from jdconrad March 12, 2024 16:30
@github-actions
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @pmpailis, I've created a changelog YAML for you.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me.

When we fetch data, we fetch from + size, it seems to me we should be fetching from + max(size + window_size). This is what we do with other things that use window_size (e.g. rescore).

@pmpailis
Copy link
Contributor Author

When we fetch data, we fetch from + size, it seems to me we should be fetching from + max(size + window_size). This is what we do with other things that use window_size (e.g. rescore).

FWICT when reaching out to shards, we currently fetch from + window_size docs from each one, as we override size in RankSearchContext, and use from from the global SearchContext. IIUC and given the current restriction of window_size >= size, I think that we should be in line with your suggestion.

query. A higher value will improve result relevance at the cost of performance. The final
ranked result set is pruned down to the search request's <<search-size-param, size>>.
`window_size` must be greater than or equal to `size` and greater than or equal to `1`.
`window_size` must be greater than or equal to `from + size` and greater than or equal to `1`.
Copy link
Contributor Author

@pmpailis pmpailis Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing Ben's comment and taking another look at this, I'm not sure if we need to update docs here (and the check in SearchRequest). If we are to operate on (from + window_size) then window_size would just be an override for size, i.e. fetch the window_size docs after already consuming from docs.

So in order to support pagination only through the from param, window_size should probably just be compared against size as is now. (i.e. one shouldn't have to change both from and window_size to do pagination).

Will see to update this accordingly.

Copy link
Contributor Author

@pmpailis pmpailis Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary though, for LTR we mention that:

Scores returned by LTR models are usually not comparable with the scores issued by the first pass query and can be lower than the non-rescored score. This can cause the non-rescored result document to be ranked higher than the rescored document. To prevent this, the window_size parameter is mandatory for LTR rescorers and should be greater than or equal to from + size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this isn't probably such an issue for rrf I wonder if we should have a more "standardized" and generic approach here with other potential rankers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key for rescorers in general is that the window_size doesn't change as you paginate. We don't enforce anything directly in rescorers and I don't think we should.

Similarly, I don't think we should enforce anything with retrievers. We can document the behavior & situations where RRF might paginate things strangely and signal that window_size should be the entire pageable set (e.g. from+size). But I am against making that mandatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signal that window_size should be the entire pageable set

I agree with that as well. So essentially, we document that window_size should be at least from + size and that pagination happens only within window_size results, but we could for example return no results if window_size is not configured properly for pagination (e.g. window_size: 10, size:5, from: 100)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent So for re-rankers are you saying that window_size should be independent of from in this case? IE pagination is from + size always and window_size needs to accommodate this by being >=. I think I like this approach because it if I'm understanding correctly, pagination within a specific window_size will lead to consistent results. And if window_size changes we can document that pagination does not necessarily remain the same. The only concern I would have is users may have very large window_size to support this when they may not need them for the initial pages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, based on QueryRescorer and QueryPhaseCollectorManager, what we do now is that:

  • we iterate over all Math.max(from + size, window_size) results
  • rescore the top window_size results in place (don't care if window_size is less than from + size
  • reorder all results, with expected potential failures in pagination if window_size < from + size

We can be lenient in not enforcing window_size > from + size, but if this doesn't hold, we would need to do one of the following:

  • return no results
  • return potentially strangely ordered results (as in rescore)

So, taking this example window_size: 10, size:5, from: 100, we could either say that:

  • we're reordering just the top 10 results, so a from value of 100 will return no results, or
  • say that we're computing updated scores for just the top 10 results out of 105 (from + size), and reorder all results 105 based on that.

(in either case, we'd need to be careful on how many documents we sent out to shards in the feature reranking phase)

@jdconrad
Copy link
Contributor

@pmpailis This looks great so far! Thanks for fixing this.

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

Added another take on this in 6fcdaf9

  • renaming window_size to rank_window_size for RRF (and all rankers) so that we can differentiate from rescore's window_size (to avoid confusion if they end up behaving differently)
  • enforcing rank_window_size to be the whole result set upon which we perform pagination for RRF. Will provide additional examples to justify this change

For other potential rankers, we can be a bit more "lenient" and perform pagination / compute ranks for [from: rank_window_size] ranges, ensuring that pagination is consistent, iff rank_window_size remains the same.

@benwtrent
Copy link
Member

@pmpailis this change is pretty intense. Could you write up in the docs some examples about why the change takes place and why its necessary due to the nature of rrf?

Since the document sets are independent, different documents may appear at different pages for each individual doc set, meaning the document could appear in later pages even if it appeared in earlier pages. Walking users through this would be useful.

@pmpailis
Copy link
Contributor Author

Sure thing @benwtrent ; I will follow up with some doc changes. As a first step I wanted to see if you're ok with the current proposal, and if we agree on the approach, I was also thinking of maybe adding a blog post (aside from docs) to do a clear walkthrough on what the expectations are.

@benwtrent
Copy link
Member

@pmpailis my main concern is that we are making the correct assumptions and that we have tests and examples walking through paging RRF in adverse scenarios.

@kovyrin
Copy link

kovyrin commented Apr 16, 2024

Very straightforward change, thank you!

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

Added some documentation & tests to illustrate pagination for rrf in 9e87c0c.

@jdconrad @benwtrent could you please take a look ?

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the new tests and documentation.

I added a small request for clarification around resetting the from=0, but this is looking so good.

Comment on lines 644 to 646
sortedTopDocs = new SortedTopDocs(rankedDocs, false, null, null, null, 0);
size = sortedTopDocs.scoreDocs.length;
from = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are resetting from = 0 here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is necessary, we need to add a comment as to why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is needed because we use the from from the ReducedQueryPhase in SearchPhaseController#getHits in order to trim down the results. Since pagination has already taken place within the ranker, we don't need any additional operations in the later stages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added in fba961b



---
"Pagination within interleaved results":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

original RRF search as expected.


==== Pagination in RRF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is so much better with the tests & documentation.

One thing we should be clear about, is that since the individual query results are still gathered via from+size like normal, the additional overhead of a large RRF window is actually pretty minimal. The cost of gathering extra results is already their with regular pagination.

Regardless, thank you so much for iterating on this with me, this sort of thing is really tricky to get right. I think we are there now (🤞)

@pmpailis
Copy link
Contributor Author

One thing we should be clear about, is that since the individual query results are still gathered via from+size like normal, the additional overhead of a large RRF window is actually pretty minimal. The cost of gathering extra results is already their with regular pagination.

Well... 😅 As things stand, for each individual query on RRF, we make use of a RankSearchContext which overrides size with window_size (https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/search/rank/RankSearchContext.java#L185). So, I think that (unfortunately) the performance impact will still be there if we keep this as-is. Changing this will mean that we will gather from+size docs (as usual) and will apply window_size as essentially an upper bound on what can be reranked. But if we do that, I think that we might face consistency issues as the result set from each query will dynamically change as from changes 🤔 Also, for window_size we currently mention in the docs that:

window_size
(Optional, integer) This value determines the size of the individual result sets per query. A higher value will improve result relevance at the cost of performance.

Apologies for going back and forth with this, but it can indeed turn out really tricky :/

@benwtrent
Copy link
Member

So, I think that (unfortunately) the performance impact will still be there if we keep this as-is.

I thought this PR removed this override? Maybe I missed where that override occurs.

Regardless, I guess there still is a perf impact as a high window size would end up gathering more results.

@pmpailis
Copy link
Contributor Author

Tbh things got a bit confusing after so many iterations 😅 ! Thanks for the review and for all the brainstorming @benwtrent !

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

@pmpailis
Copy link
Contributor Author

@elasticmachine update branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v8.15.0

6 participants