Trigger merges after recovery#113102
Conversation
We may have shut a shard down while merges were still pending (or adjusted the merge policy while the shard was down) meaning that after recovery its segments do not reflect the desired state according to the merge policy. With this commit we invoke `IndexWriter#maybeMerge()` at the end of recovery to check for, and execute, any such lost merges.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
| return recoveryListener; | ||
| } | ||
|
|
||
| logger.trace(Strings.format("wrapping listener for post-recovery merge of [%s]", shardId)); |
There was a problem hiding this comment.
| logger.trace(Strings.format("wrapping listener for post-recovery merge of [%s]", shardId)); | |
| logger.trace(() -> Strings.format("wrapping listener for post-recovery merge of [%s]", shardId)); |
(same remark for other log traces)
server/src/main/java/org/elasticsearch/indices/IndicesService.java
Outdated
Show resolved
Hide resolved
henningandersen
left a comment
There was a problem hiding this comment.
Looks good, one question on how to ensure a flush happens.
|
|
||
| ensureGreen(indexName); | ||
| assertBusy(() -> { | ||
| refresh(indexName); // pick up the result of any merges |
There was a problem hiding this comment.
I wonder if we need to let afterMerge set active to true to ensure that we get a flush after a merge always? I think otherwise we risk a flush happening too early, before the merge completes (especially if the processing is lagging due to running on one thread) and then no more flush after the merge.
And if we do that, we should probably change this to call flushOnIdle instead?
There was a problem hiding this comment.
ensure that we get a flush after a merge always?
That sounds sensible to me but I'm not sure if it could have any bad consequences. Why don't we do it already?
There was a problem hiding this comment.
Note that we already flush just before snapshotting, and before relocating the shard in stateless ES. Not saying that we shouldn't try and get this to flush in flushOnIdle too, but maybe we can think about that in a follow-up?
There was a problem hiding this comment.
Thanks, that helps. I also now realize that the check here:
is more than likely to kick in if the active->inactive flush already occurred, making this more of a benign race condition problem than something that would repeatedly happen.
I wonder if it was better to set indices.memory.shard_inactive_time=0 in this test and then avoid the refresh here, demonstrating that it will indeed flush after the merge?
There was a problem hiding this comment.
Sure yes that makes sense.
ywangd
left a comment
There was a problem hiding this comment.
I have a question about the throttling behaviour. I'd appreciate if you could help me understand it better. Thanks!
| return; | ||
| } | ||
|
|
||
| indexShard.triggerPendingMerges(); |
There was a problem hiding this comment.
This ultimately calls IndexWriter#maybeMerge which kicks off actual merges with "Lucene Merge Thread". It does not wait for the merges to complete. So my understanding is that we can kick off merges for multiple shards even with the throttled task runner of a single thread? Did I miss something, or maybe this is intended?
There was a problem hiding this comment.
That's intended, see comment on ES-9313. I copied this info into a code comment in e70887c.
|
|
||
| public void triggerPendingMerges() throws IOException { | ||
| switch (state /* single volatile read */) { | ||
| case STARTED, POST_RECOVERY -> getEngine().forceMerge(false, ForceMergeRequest.Defaults.MAX_NUM_SEGMENTS, false, null); |
There was a problem hiding this comment.
Nit: Can we add a comment about passing flush=false and null for the uuid? My understanding is that they do not make sense when the downstream code calls IndexWriter#maybeMerge. But it is not obvious from here.
Also for my knowledge: Since we don't flush actively, I guess it relies on flushes triggered by other mechanisms, such as scheduled refresh and indexing disk and memory controllers and maybe other things?
There was a problem hiding this comment.
More comments in c07131b.
Henning and I are still discussing how to ensure we flush the result of the merge.
...erTest/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDeciderIT.java
Show resolved
Hide resolved
| protected BroadcastResponse forceMerge() { | ||
| waitForRelocation(); | ||
| BroadcastResponse actionGet = indicesAdmin().prepareForceMerge().setMaxNumSegments(1).get(); | ||
| BroadcastResponse actionGet = indicesAdmin().prepareForceMerge().setMaxNumSegments(1).setFlush(true).get(); |
There was a problem hiding this comment.
I think that is the default anyway, not sure why this change is necessary?
There was a problem hiding this comment.
Ah sorry that was leftover from debugging the DiskThresholdDeciderIT failure, not necessary indeed. I'll remove it.
We may have shut a shard down while merges were still pending (or
adjusted the merge policy while the shard was down) meaning that after
recovery its segments do not reflect the desired state according to the
merge policy. With this commit we invoke
IndexWriter#maybeMerge()atthe end of recovery to check for, and execute, any such lost merges.