Retrieve routing hash from synthetic id for translog operations#140221
Retrieve routing hash from synthetic id for translog operations#140221tlrx merged 7 commits intoelastic:mainfrom
Conversation
In a TSDB datastream shard that uses synthetic id, the operations that are read from the translog have no routing set. They should have a routing computed from the routing hash stored as a suffix in the synthetic id. This change fix the routing hash to be retrieved correctly and also adds a test. Relates ES-13606
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @tlrx, I've created a changelog YAML for you. |
| assertThat(luceneSnapshot.totalOperations(), equalTo(docsIdsBySeqNo.size())); | ||
| // TODO Once ES-13603 is implemented, change this to also check operations (and maybe tombstone doc too?) | ||
| if (docsIdsBySeqNo.isEmpty() == false) { | ||
| expectThrows(NullPointerException.class, luceneSnapshot::next); |
There was a problem hiding this comment.
This test will also be useful for ES-13603 , shown here as an expected issue: the stored _id field is not materialized yet and an NPE is thrown.
fcofdez
left a comment
There was a problem hiding this comment.
LGTM. Subtle stuff... thanks for fixing this!
...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java
Show resolved
Hide resolved
| ); | ||
| if (context.indexSettings().useTimeSeriesSyntheticId()) { | ||
| int hash = TsidExtractingIdFieldMapper.extractRoutingHashFromSyntheticId(Uid.encodeId(context.sourceToParse().id())); | ||
| routingHash = encode(hash); |
There was a problem hiding this comment.
I understand that this got really complex, but I find it a bit confusing that we're using two encode methods for the routing hash (synthetic and the regular one using the Uid.encode), not much that we can do... :(
There was a problem hiding this comment.
Just to be sure we share the same understanding:
- Base64 encoding is used to encode/decode the routing hash integer as a routing
Stringkey in several places (mostly when routing operations or parsing documents from source like here) - Base64 encoding is also used to encode the (regular or synthetic) id from
byte[]array toStringand vice-versa Uid.encodeandUid.decodeare used to encode the (regular or synthetic)Stringid into the binary data stored in Lucene (asbyte[]array too inBytesRef)
Here we need to extract the routing hash integer value back from the (regular or synthetic) String id. In order to have only one "extract routing hash from id" method I reused TsidExtractingIdFieldMapper.extractRoutingHashFromSyntheticId but it means that I have to Uid.encode the String back to its Lucene represention.
Maybe it would be more readable if I only read the last 4 bytes of the id (like regular id do)?
There was a problem hiding this comment.
That's my understanding too, but I would prefer to keep it as it's so we try to centralize the knowledge on how we encode the _id.
| "Lucene document [" + expectedDocId + "] has wrong value for _ts_routing_hash field", | ||
| luceneDocument.getField(TimeSeriesRoutingHashFieldMapper.NAME).binaryValue(), | ||
| equalTo( | ||
| Uid.encodeId( |
There was a problem hiding this comment.
Maybe we can extract this to a helper method, feels like inception ⏳ haha
|
|
||
| Translog.Operation operation; | ||
| while ((operation = translogSnapshot.next()) != null) { | ||
| if (operation instanceof Translog.Index index) { |
There was a problem hiding this comment.
nit: we could use the Translog.Operation#opType and a switch instead?
| Arrays.copyOf(Base64.getUrlDecoder().decode(context.sourceToParse().id()), 4) | ||
| ); | ||
| if (context.indexSettings().useTimeSeriesSyntheticId()) { | ||
| int hash = TsidExtractingIdFieldMapper.extractRoutingHashFromSyntheticId(Uid.encodeId(context.sourceToParse().id())); |
There was a problem hiding this comment.
shouldn't we decode the base64 here?
There was a problem hiding this comment.
I don't think so. The String id is already encoded as Base64 and we encoded it using Uid.encodeId to have the binary data that would be indexed in Lucene and that the extractRoutingHashFromSyntheticId expects.
See my other comment here, I get how it can be confusing.
There was a problem hiding this comment.
🤦 I was fooled by the encodeId name, which looking into the implementation does decode the base64.
| @@ -107,7 +107,7 @@ public static ParsedDocument deleteTombstone( | |||
| // _id terms over tombstones also work as if a regular _id field was present. | |||
There was a problem hiding this comment.
I've noticed that we could avoid the Uid.encode in:
document.add(IdFieldMapper.syntheticIdField(id));
Since we have the uid already?
…est' into 2026/01/06/ES-13603-test
|
Thanks Francisco! |
In a TSDB datastream shard that uses synthetic id, the operations that are read from the translog have no routing set. They should have a routing computed from the routing hash stored as a suffix in the synthetic id.
This change fix the routing hash to be retrieved correctly and also adds a test.
Relates ES-13606