Optimize AsyncSearchErrorTraceIT to avoid failures#137716
Optimize AsyncSearchErrorTraceIT to avoid failures#137716drempapis merged 17 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
| return; | ||
| } | ||
|
|
||
| // Make sure the .async-search system index is green before deleting it |
There was a problem hiding this comment.
By the time we reach the cleanup phase, the .async-search shard may still be relocating or recovering, which is when shard-lock timeouts are most likely to occur during test teardown. To prevent this, we ensure that the .async-search system index is fully ready and stable before deleting the async search result.
| // check that the stack trace was not sent from the data node to the coordinating node | ||
| ErrorTraceHelper.assertStackTraceCleared(internalCluster()); | ||
| } finally { | ||
| deleteAsyncSearchIfPresent(createAsyncResponseEntity); |
There was a problem hiding this comment.
This makes sense to me, I see a similar thing is done in CCSDuelIT for async searches.
There was a problem hiding this comment.
Thank you, Ben, for the review. Yes, this is actually the most important part of this PR, ensuring that the entry (with id) in the index is deleted before the test reaches the “after test cleanup,” where the exception is thrown.
| value = "org.elasticsearch.xpack.search.MutableSearchResponse:DEBUG,org.elasticsearch.xpack.search.AsyncSearchTask:DEBUG" | ||
| ) | ||
| public class AsyncSearchErrorTraceIT extends ESIntegTestCase { | ||
| public class AsyncSearchErrorTraceIT extends AsyncSearchIntegTestCase { |
There was a problem hiding this comment.
Is this needed for the fix or is it an unrelated improvement?
There was a problem hiding this comment.
This isn’t necessary for this test; I added it to keep it consistent with the other tests in the same package.
| XContentType entityContentType = XContentType.fromMediaType(response.getEntity().getContentType().getValue()); | ||
| return XContentHelper.convertToMap(entityContentType.xContent(), response.getEntity().getContent(), false); | ||
|
|
||
| HttpEntity entity = response.getEntity(); |
There was a problem hiding this comment.
Can you explain what this entity stuff has to do with the test errors?
There was a problem hiding this comment.
I guess nothing! :) I got the idea from ESRestTestCase, to ensure that the connection is released back to the pool regardless of what happens. Upon re-examining this, the ShardLockObtainFailedException is a server-side issue related to the shard lifecycle, and consuming the entity doesn’t affect shard locks, it only impacts client connection reuse. I’ll revert this part.
However, I’m keeping it in the deleteAsyncSearchIfPresent to ensure that the Http connection is fully consumed and returned to the pool before teardown.
…rempapis/elasticsearch into fix/AsyncSearchErrorTraceIT_shard_lock
benchaplin
left a comment
There was a problem hiding this comment.
LGTM, thanks for explaining!
💚 Backport successful
|
…-json * upstream/main: (158 commits) Cleanup files from repo root folder (elastic#138030) Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624) Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716) Removes support for null TransportService in RemoteClusterService (elastic#137939) Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018 rest-api-spec: fix type of enums (elastic#137521) Update Gradle wrapper to 9.2.0 (elastic#136155) Add RCS Strong Verification Documentation (elastic#137822) Use docvalue skippers on dimension fields (elastic#137029) Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012 Fix ES|QL search context creation to use correct results type (elastic#137994) Improve Snapshot Logging (elastic#137470) Support extra output field in TOP function (elastic#135434) Remove NumericDoubleValues class (elastic#137884) [ML] Fix ML calendar event update scalability issues (elastic#136886) Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865) Refine workaround for S3 repo analysis known issue (elastic#138000) Additional DEBUG logging on authc failures (elastic#137941) Cleanup index resolution (elastic#137867) ...
In some previous work (#137078), we introduced changes that mitigated the
.async-searchshard lock issue. While that significantly reduced the frequency of the problem, a few cases still persisted.The error can occur when
HTTPResponse objects are not fully closed, or when persisted.async-searchresults (keep_on_completion=true) are not deleted. Either situation can leave file handles open long enough that the .async-search shard remains locked during cluster teardown, resulting in:This change further addresses the issue by preventing unreleased resources that cause
.async-searchshard lock failures:HttpEntityto release connections..async-searchresults created withkeep_on_completion=trueare now deleted in a finally block after each test.Closes #137150