Fix for from parameter when using sub_searches and rank#106253
Fix for from parameter when using sub_searches and rank#106253pmpailis merged 24 commits intoelastic:mainfrom
Conversation
|
Documentation preview: |
|
Pinging @elastic/es-search (Team:Search) |
|
Hi @pmpailis, I've created a changelog YAML for you. |
benwtrent
left a comment
There was a problem hiding this comment.
This doesn't seem right to me.
When we fetch data, we fetch from + size, it seems to me we should be fetching from + max(size + window_size). This is what we do with other things that use window_size (e.g. rescore).
FWICT when reaching out to shards, we currently fetch |
| query. A higher value will improve result relevance at the cost of performance. The final | ||
| ranked result set is pruned down to the search request's <<search-size-param, size>>. | ||
| `window_size` must be greater than or equal to `size` and greater than or equal to `1`. | ||
| `window_size` must be greater than or equal to `from + size` and greater than or equal to `1`. |
There was a problem hiding this comment.
Discussing Ben's comment and taking another look at this, I'm not sure if we need to update docs here (and the check in SearchRequest). If we are to operate on (from + window_size) then window_size would just be an override for size, i.e. fetch the window_size docs after already consuming from docs.
So in order to support pagination only through the from param, window_size should probably just be compared against size as is now. (i.e. one shouldn't have to change both from and window_size to do pagination).
Will see to update this accordingly.
There was a problem hiding this comment.
On the contrary though, for LTR we mention that:
Scores returned by LTR models are usually not comparable with the scores issued by the first pass query and can be lower than the non-rescored score. This can cause the non-rescored result document to be ranked higher than the rescored document. To prevent this, the
window_sizeparameter is mandatory for LTR rescorers and should be greater than or equal tofrom + size.
There was a problem hiding this comment.
While this isn't probably such an issue for rrf I wonder if we should have a more "standardized" and generic approach here with other potential rankers.
There was a problem hiding this comment.
The key for rescorers in general is that the window_size doesn't change as you paginate. We don't enforce anything directly in rescorers and I don't think we should.
Similarly, I don't think we should enforce anything with retrievers. We can document the behavior & situations where RRF might paginate things strangely and signal that window_size should be the entire pageable set (e.g. from+size). But I am against making that mandatory.
There was a problem hiding this comment.
signal that window_size should be the entire pageable set
I agree with that as well. So essentially, we document that window_size should be at least from + size and that pagination happens only within window_size results, but we could for example return no results if window_size is not configured properly for pagination (e.g. window_size: 10, size:5, from: 100)
There was a problem hiding this comment.
@benwtrent So for re-rankers are you saying that window_size should be independent of from in this case? IE pagination is from + size always and window_size needs to accommodate this by being >=. I think I like this approach because it if I'm understanding correctly, pagination within a specific window_size will lead to consistent results. And if window_size changes we can document that pagination does not necessarily remain the same. The only concern I would have is users may have very large window_size to support this when they may not need them for the initial pages.
There was a problem hiding this comment.
IIUC, based on QueryRescorer and QueryPhaseCollectorManager, what we do now is that:
- we iterate over all
Math.max(from + size, window_size)results - rescore the top
window_sizeresults in place (don't care ifwindow_sizeis less thanfrom + size - reorder all results, with expected potential failures in pagination if
window_size<from + size
We can be lenient in not enforcing window_size > from + size, but if this doesn't hold, we would need to do one of the following:
- return no results
- return potentially strangely ordered results (as in rescore)
So, taking this example window_size: 10, size:5, from: 100, we could either say that:
- we're reordering just the top 10 results, so a
fromvalue of 100 will return no results, or - say that we're computing updated scores for just the top 10 results out of 105 (
from + size), and reorder all results105based on that.
(in either case, we'd need to be careful on how many documents we sent out to shards in the feature reranking phase)
|
@pmpailis This looks great so far! Thanks for fixing this. |
|
@elasticmachine update branch |
|
Added another take on this in 6fcdaf9
For other potential rankers, we can be a bit more "lenient" and perform pagination / compute ranks for |
|
@pmpailis this change is pretty intense. Could you write up in the docs some examples about why the change takes place and why its necessary due to the nature of rrf? Since the document sets are independent, different documents may appear at different pages for each individual doc set, meaning the document could appear in later pages even if it appeared in earlier pages. Walking users through this would be useful. |
|
Sure thing @benwtrent ; I will follow up with some doc changes. As a first step I wanted to see if you're ok with the current proposal, and if we agree on the approach, I was also thinking of maybe adding a blog post (aside from docs) to do a clear walkthrough on what the expectations are. |
|
@pmpailis my main concern is that we are making the correct assumptions and that we have tests and examples walking through paging RRF in adverse scenarios. |
|
Very straightforward change, thank you! |
|
@elasticmachine update branch |
|
Added some documentation & tests to illustrate pagination for @jdconrad @benwtrent could you please take a look ? |
|
merge conflict between base and head |
benwtrent
left a comment
There was a problem hiding this comment.
Love the new tests and documentation.
I added a small request for clarification around resetting the from=0, but this is looking so good.
| sortedTopDocs = new SortedTopDocs(rankedDocs, false, null, null, null, 0); | ||
| size = sortedTopDocs.scoreDocs.length; | ||
| from = 0; |
There was a problem hiding this comment.
I don't understand why we are resetting from = 0 here.
There was a problem hiding this comment.
If this is necessary, we need to add a comment as to why.
There was a problem hiding this comment.
This one is needed because we use the from from the ReducedQueryPhase in SearchPhaseController#getHits in order to trim down the results. Since pagination has already taken place within the ranker, we don't need any additional operations in the later stages.
|
|
||
|
|
||
| --- | ||
| "Pagination within interleaved results": |
| original RRF search as expected. | ||
|
|
||
|
|
||
| ==== Pagination in RRF |
benwtrent
left a comment
There was a problem hiding this comment.
Ah, this is so much better with the tests & documentation.
One thing we should be clear about, is that since the individual query results are still gathered via from+size like normal, the additional overhead of a large RRF window is actually pretty minimal. The cost of gathering extra results is already their with regular pagination.
Regardless, thank you so much for iterating on this with me, this sort of thing is really tricky to get right. I think we are there now (🤞)
Well... 😅 As things stand, for each individual query on RRF, we make use of a
Apologies for going back and forth with this, but it can indeed turn out really tricky :/ |
I thought this PR removed this override? Maybe I missed where that override occurs. Regardless, I guess there still is a perf impact as a high window size would end up gathering more results. |
|
Tbh things got a bit confusing after so many iterations 😅 ! Thanks for the review and for all the brainstorming @benwtrent ! |
|
@elasticmachine update branch |
|
@elasticmachine update branch |
It seems that pagination does not work as expected for
rrfsearches, so in this PR we update theRRFRankCoordinatorContextto simply skip the firstfromresults from the combined top ranked docs prior to generating the final list to return. This has the following implications:window_size >= (from + size)to make sure that we have enough docs to rank and do the necessary pagination. (maybe we should also enforce an upper limit onwindow_size? 🤔 )reducedQueryPhase. Once we're done withRRFRankCoordinatorContext#rankwe resetfromto be 0 and proceed as usual.Closes #99011