[Inference API] add service and task type aware rate limiting#125880
Closed
brendan-jugan-elastic wants to merge 10 commits intoelastic:mainfrom
Closed
[Inference API] add service and task type aware rate limiting#125880brendan-jugan-elastic wants to merge 10 commits intoelastic:mainfrom
brendan-jugan-elastic wants to merge 10 commits intoelastic:mainfrom
Conversation
| TreeMap<TaskType, MaxNodesPerGroupingStrategy> alibabaCloudSearchConfigs = new TreeMap<>(); | ||
| var alibabaCloudSearchService = serviceRegistry.getService(AlibabaCloudSearchService.NAME); | ||
| if (alibabaCloudSearchService.isPresent()) { | ||
| var alibabaCloudSearchTaskTypes = alibabaCloudSearchService.get().supportedTaskTypes(); |
Contributor
There was a problem hiding this comment.
I think eventually we'll want something like this but at the moment we don't support cross node streaming support so we'll definitely need to exclude the chat_completion task type.
| alibabaCloudSearchConfigs.put(taskType, defaultStrategy); | ||
| } | ||
| } | ||
| serviceNodeLocalRateLimitConfigs.put(AlibabaCloudSearchService.NAME, alibabaCloudSearchConfigs); |
Contributor
There was a problem hiding this comment.
I doubt this will ever happen but If the individual service is not present (isPresent() == false) do we still want to add the configs to the tree map?
| public static DeepSeekRequestManager.RateLimitGrouping of(DeepSeekChatCompletionModel model) { | ||
| Objects.requireNonNull(model); | ||
|
|
||
| return new DeepSeekRequestManager.RateLimitGrouping(model.apiKey().hashCode()); |
Contributor
There was a problem hiding this comment.
I believe it was intentional to limit the rate limit to the max allowed:
So I think we should revert the changes around the api key here.
cc: @prwhelan
Member
There was a problem hiding this comment.
This is correct - there is effectively no rate limiting for DeepSeek
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The goal of this PR is to address the rate-limiting follow-up TODOs introduced by this PR and tracked by this issue in order to support service and task type aware rate-limiting.