Use sub keyword block loader with ignore_above for text fields#140622
Use sub keyword block loader with ignore_above for text fields#140622dnhatn merged 6 commits intoelastic:mainfrom
Conversation
9e4799c to
de3d1ef
Compare
6e706dd to
a6af71c
Compare
a6af71c to
2881798
Compare
|
Hi @dnhatn, I've created a changelog YAML for you. |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| } | ||
|
|
||
| private BlockLoader nonDelegateBlockLoader(BlockLoaderContext blContext) { | ||
| // 2. check if we can load from a parent field |
There was a problem hiding this comment.
Nit: move this above the function.
There was a problem hiding this comment.
Well numbering is off.. we can just remove them from all comments.
kkrik-es
left a comment
There was a problem hiding this comment.
Looks good, Martijn has a better view of the work in text fields so I'll let him approve.
martijnvg
left a comment
There was a problem hiding this comment.
Thanks Nhat, I like this solution!
In a followup, we can look into use a block loader that uses the ignored values that are stored in binary doc values (in main only) as fallback, instead of falling back to source.
| * (under the limit) and doc-2 has the value "bcd..." (exceeds the limit), we can load doc-1 from the doc_values | ||
| * of keyword field and doc-2 from the slower stored fields. | ||
| */ | ||
| abstract class ConditionalBlockLoader implements BlockLoader { |
| return fallbackLoader; | ||
| } | ||
|
|
||
| private BlockLoader nonDelegateBlockLoader(BlockLoaderContext blContext) { |
| public boolean supportsOrdinals() { | ||
| return false; | ||
| } | ||
|
|
||
| @Override | ||
| public SortedSetDocValues ordinals(LeafReaderContext context) throws IOException { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Maybe in a follow up pr, these methods can be removed? I don't see this being used any more?
Yes, I will do that. I also think we need to apply this change to other text-family types. |
|
@martijnvg @kkrik-es Thanks for reviewing! |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…ic#140622) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…ic#140622) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
…ic#140622) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
…) (#140787) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
…ic#140622) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
…) (#140789) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues. (cherry picked from commit f15069a)
…ic#140622) Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above. This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb. There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up. Marking this as a bug fix for performance issues.
Today, we do not use the block loader of the sub-keyword field when loading the text field if
ignore_aboveis set. Whenignore_aboveis configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values fromdoc_valuesinstead of stored fields. If some documents exceed the threshold, we should load values fromdoc_valuesfor those below the threshold and from stored fields for those above.This PR leverages the terms dictionary from the
_ignoredfield to prefer loading values fromdoc_valuesof the sub-keyword field when possible. For any document where the sub-keyword field appears in the_ignoreddictionary, we load from stored fields or_source; otherwise, we usedoc_values. This improves performance when loading text fields, especially for logsdb.There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up.
Marking this as a bug fix for performance issues.