Skip to content

[ML] Wait for all shards to be active when creating the ML stats index#108202

Merged
maxhniebergall merged 9 commits intoelastic:mainfrom
davidkyle:do-not-write-stats
Oct 30, 2024
Merged

[ML] Wait for all shards to be active when creating the ML stats index#108202
maxhniebergall merged 9 commits intoelastic:mainfrom
davidkyle:do-not-write-stats

Conversation

@davidkyle
Copy link
Member

@davidkyle davidkyle commented May 2, 2024

Several tests have started failing recently with the dread no_shard_available_action_exception when querying the .ml-stats index. The error occurs when an index is accessed directly after creation before any shards have become active. See #65846 for more details.

The recent failures stem from the change to invalidate the model cache in #106988. The cache is invalidated on PUT model which causes the models to be evicted and on eviction they persists stats to the .ml-stats index. The failing tests create a model causing the cache eviction and triggering a write to .ml-stats, immediately after the test clean up kicks in and during the model deletion the .ml-stats index is queried throwing the no_shard_available_action_exception.

The first part to wait for all shard to become active when creating the index but that isn't sufficient in itself as the writes to .ml-stats are async. Creating or deleting a model kicks off the write but it may not have completed so the post test clean up needs to wait for the .ml-stats to initialise.

Closes #106652
Closes #107815
Closes #107777
Closes #107505
Closes #80703

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.14.0 v8.15.0 labels May 2, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label May 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle changed the title Wait for all shards to be active when creating the ML stats index May 2, 2024
@przemekwitek przemekwitek self-requested a review May 6, 2024 13:32
Copy link

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

maxhniebergall and others added 3 commits July 25, 2024 16:44
Co-authored-by: Pat Whelan <pat.whelan@elastic.co>
…idkyle/do-not-write-stats

# Conflicts:
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ml/utils/MlIndexAndAliasTests.java
#	x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/inference_crud.yml
#	x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/inference_processor.yml
@maxhniebergall maxhniebergall requested a review from prwhelan July 25, 2024 21:14
@maxhniebergall
Copy link
Contributor

This fix seemed super close to being done, so I just merged main into it and added Pat's change. I was able to reproduce the test failures before this change, and not after, so everything LGTM

@maxhniebergall
Copy link
Contributor

@elasticmachine merge upstream

@maxhniebergall maxhniebergall added v8.17.0 auto-backport Automatically create backport pull requests when merged and removed v8.14.0 labels Oct 29, 2024
@maxhniebergall maxhniebergall merged commit 6b5f6fb into elastic:main Oct 30, 2024
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Oct 30, 2024
elastic#108202)

* Wait for all shards to be active when creating the ML stats index

* Unmute tests

* Wait for the stats index in cleanup

* more waiting for the stats index

* Add adminclient to ensureHealth

Co-authored-by: Pat Whelan <pat.whelan@elastic.co>

* fix errors causing build failures

---------

Co-authored-by: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com>
Co-authored-by: Pat Whelan <pat.whelan@elastic.co>
Co-authored-by: Max Hniebergall <max.hniebergall@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
elastic#108202)

* Wait for all shards to be active when creating the ML stats index

* Unmute tests

* Wait for the stats index in cleanup

* more waiting for the stats index

* Add adminclient to ensureHealth

Co-authored-by: Pat Whelan <pat.whelan@elastic.co>

* fix errors causing build failures

---------

Co-authored-by: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com>
Co-authored-by: Pat Whelan <pat.whelan@elastic.co>
Co-authored-by: Max Hniebergall <max.hniebergall@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v8.17.0 v9.0.0

7 participants