Fix enrich caches outdated value after policy run#133680
Fix enrich caches outdated value after policy run#133680joegallo merged 9 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @joegallo, I've created a changelog YAML for you. |
jbaiera
left a comment
There was a problem hiding this comment.
Mostly a question on how the search runner updates after a policy execution
| private SearchRunner createSearchRunner(final ProjectMetadata project, final String indexAlias) { | ||
| final Client originClient = new OriginSettingClient(client, ENRICH_ORIGIN); | ||
| return (value, maxMatches, reqSupplier, handler) -> { | ||
| final String concreteEnrichIndex = getEnrichIndexKey(project, indexAlias); |
There was a problem hiding this comment.
We capture this enrich index name here when we create the processor, but how will this get updated if we execute an enrich policy and a new index is created?
There was a problem hiding this comment.
It's all lunatic higher order functions (I say with the highest compliments) -- we capture the client at createSearchRunner invocation time, that is, when a processor is created. But the search runner itself doesn't run until it's invoked for some document in AbstractEnrichProcessor#execute and that's when the concreteEnrichIndex is finally set (and of course the handler is another functional argument so this is all clear as mud when one reads the code).
There was a problem hiding this comment.
This search runner captures the project metadata to extract that concrete enrich index name, but it only does this at processor creation time. Is there a possibility that when we execute an enrich policy, thus updating the concrete index to use, but this processor isn't recreated when that happens? It looks like we only create a processor if it's configuration changes after a cluster state update.
There was a problem hiding this comment.
Yeahhhhhhhhhhhh... to be clear, that problem existed before this, though, right? I mean, it seems like it does work... but I must admit I'm not sure I see how it does work just yet.
| protected Settings nodeSettings() { | ||
| return Settings.builder() | ||
| // TODO Change this to run with security enabled | ||
| // https://github.com/elastic/elasticsearch/issues/75940 |
There was a problem hiding this comment.
I've already added this new test to #75940.
💔 Backport failed
You can use sqren/backport to manually backport by running |
|
Backports are up, so I'm dropping the backport pending label. |
There's a (likely very rare) race condition in the way we update the enrich alias that can result in outdated entries being stored in the cache. As a workaround, if you execute the policy a second time (without having changed the data in the source index), the cache will certainly contain the correct information.
We update the cache key based on the concrete index name associated with the enrich alias based on the cluster state that we see in a cluster state applier. However, we execute searches (and cache the found values) based on the concrete index name associated with the enrich alias based on the current cluster state (after listeners and appliers have run -- I'm a little squishy on exactly this detail). Suffice it to say, there's a race where if there's a slow enough cluster state applier, then it's possible that we run the search against the old enrich index, but we cache the value as if it was for the new enrich index. That is, it results in an outdated entry being stored in the cache (😱).
The fix is to stop searching through the alias. Instead, we use the same dereferencing logic to ensure that the index that we record the cache entry for is also the concrete index that we searched against.