[ML] Implementing latency improvements for EIS integration#133861
[ML] Implementing latency improvements for EIS integration#133861jonathan-buttner merged 12 commits intoelastic:mainfrom
Conversation
|
Hi @jonathan-buttner, I've created a changelog YAML for you. |
…ticsearch into ml-improve-latency
|
Pinging @elastic/ml-core (Team:ML) |
...c/main/java/org/elasticsearch/xpack/core/inference/action/GetInferenceDiagnosticsAction.java
Show resolved
Hide resolved
...lasticsearch/xpack/core/inference/action/GetInferenceDiagnosticsActionNodeResponseTests.java
Outdated
Show resolved
Hide resolved
...ugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/HttpClient.java
Outdated
Show resolved
Hide resolved
...ugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/HttpClient.java
Outdated
Show resolved
Hide resolved
| public static final Setting<Integer> MAX_ROUTE_CONNECTIONS = Setting.intSetting( | ||
| "xpack.inference.http.max_route_connections", | ||
| 20, // default | ||
| 200, // default |
There was a problem hiding this comment.
10x the default value is quite a step. Can we explore changing this with overrides in the environments where EIS is available
There was a problem hiding this comment.
After doing some more research, allowing more connections will results in more memory and file descriptors being used.
Can we explore changing this with overrides in the environments where EIS is available
I suspect that this would mean we'd need to put in a lot of manual overrides. Maybe we leave these defaults as is for now and add metrics to get a better idea of what typical usage looks like.
If the cluster is located in the same region and provider as EIS I typically saw ~20 connections being used after connections already existed in the pool. So when the first spike of traffic occurs it'll likely be limited by the 20 limit here and then hopefully go down after that.
| static final Setting<TimeValue> RETRY_INITIAL_DELAY_SETTING = Setting.timeSetting( | ||
| "xpack.inference.http.retry.initial_delay", | ||
| TimeValue.timeValueSeconds(1), | ||
| TimeValue.timeValueMillis(20), |
There was a problem hiding this comment.
1 second is too slow and a bad default value but I don't know what a good default is. 20ms is a very short delay, perhaps 100ms?
My understanding is that the latency was due to the connection pool configuration and retries weren't really happening. It would be good to limit the scope of the changes in this PR if possible
There was a problem hiding this comment.
Yeah I can switch to 100.
| timeToWait = TimeValue.min(endpoint.executeEnqueuedTask(), timeToWait); | ||
| } | ||
| // if we execute a task the timeToWait will be 0 so we'll immediately look for more work | ||
| } while (timeToWait.compareTo(TimeValue.ZERO) <= 0); |
There was a problem hiding this comment.
nice, that was a lot easier then we thought it would be
…ticsearch into ml-improve-latency
…33861) * Adding latency improvements * Update docs/changelog/133861.yaml * [CI] Auto commit changes from spotless * Renaming test executor getter and adding response executor * [CI] Auto commit changes from spotless * Address feedback --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
…33861) * Adding latency improvements * Update docs/changelog/133861.yaml * [CI] Auto commit changes from spotless * Renaming test executor getter and adding response executor * [CI] Auto commit changes from spotless * Address feedback --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This PR implements some of the improvements from here: #133263
Notably:
inference_response_thread_poolclientBuilder.disableConnectionState();EntityUtils.consumeQuietly(response.getEntity());max_total_connectionsto 500 andmax_route_connectionsto 200xpack.inference.http.retry.initial_delayto 20ms from 1 secondRequestExecutorServicewhen work was accomplished instead of scheduling a new thread for 0 ms