[Inference API] Add node-local rate limiting for the inference API#120400
Merged
demjened merged 72 commits intoelastic:mainfrom Jan 29, 2025
Merged
[Inference API] Add node-local rate limiting for the inference API#120400demjened merged 72 commits intoelastic:mainfrom
demjened merged 72 commits intoelastic:mainfrom
Conversation
…of InferencePlugin and adjust formatting.
timgrein
commented
Jan 21, 2025
|
|
||
| List<DiscoveryNode> assignedNodes = new ArrayList<>(); | ||
|
|
||
| // TODO: here we can probably be smarter: if |num nodes in cluster| > |num nodes per task types| |
Contributor
Author
There was a problem hiding this comment.
This is something I kept out of this PR scope for now as we only need it as soon as we support multiple services and/or task types
Collaborator
|
Hi @timgrein, I've created a changelog YAML for you. |
…ting' into inference-api-adaptive-rate-limiting
…ting' into inference-api-adaptive-rate-limiting
timgrein
commented
Jan 29, 2025
| } | ||
|
|
||
| private NodeRoutingDecision determineRouting(String serviceName, Request request, UnparsedModel unparsedModel) { | ||
| if (INFERENCE_API_CLUSTER_AWARE_RATE_LIMITING_FEATURE_FLAG.isEnabled() == false) { |
Contributor
Author
There was a problem hiding this comment.
Not strictly necessary, but we can keep it for now and remove it after FF
Collaborator
💚 Backport successful
|
elasticsearchmachine
pushed a commit
that referenced
this pull request
Jan 30, 2025
…API (#120400) (#121251) * [Inference API] Add node-local rate limiting for the inference API (#120400) * Add node-local rate limiting for the inference API * Fix integration tests by using new LocalStateInferencePlugin instead of InferencePlugin and adjust formatting. * Correct feature flag name * Add more docs, reorganize methods and make some methods package private * Clarify comment in BaseInferenceActionRequest * Fix wrong merge * Fix checkstyle * Fix checkstyle in tests * Check that the service we want to the read the rate limit config for actually exists * [CI] Auto commit changes from spotless * checkStyle apply * Update docs/changelog/120400.yaml * Move rate limit division logic to RequestExecutorService * Spotless apply * Remove debug sout * Adding a few suggestions * Adam feedback * Fix compilation error * [CI] Auto commit changes from spotless * Add BWC test case to InferenceActionRequestTests * Add BWC test case to UnifiedCompletionActionRequestTests * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <demjened@gmail.com> * Update x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/common/InferenceServiceNodeLocalRateLimitCalculator.java Co-authored-by: Adam Demjen <demjened@gmail.com> * Remove addressed TODO * Spotless apply * Only use new rate limit specific feature flag * Use ThreadLocalRandom * [CI] Auto commit changes from spotless * Use Randomness.get() * [CI] Auto commit changes from spotless * Fix import * Use ConcurrentHashMap in InferenceServiceNodeLocalRateLimitCalculator * Check for null value in getRateLimitAssignment and remove AtomicReference * Remove newAssignments * Up the default rate limit for completions * Put deprecated feature flag back in * Check feature flag in BaseTransportInferenceAction * spotlessApply * Export inference.common * Do not export inference.common * Provide noop rate limit calculator, if feature flag is disabled * Add proper dependency injection --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> Co-authored-by: Adam Demjen <demjened@gmail.com> * Use .get(0) as getFirst() doesn't exist in 8.18 (probably JDK difference?) --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> Co-authored-by: Adam Demjen <demjened@gmail.com>
This was referenced Jan 31, 2025
This was referenced Mar 24, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR combines the approaches described in (I've described each idea in isolation in each PR):
Some important notes:
inference_cluster_aware_rate_limitingelasticinference provider in combination with thesparse_embeddingtask typeThe combined high-level overview looks like the following: