Skip to content

Use sub keyword block loader with ignore_above for text fields#140622

Merged
dnhatn merged 6 commits intoelastic:mainfrom
dnhatn:fallback-reader
Jan 15, 2026
Merged

Use sub keyword block loader with ignore_above for text fields#140622
dnhatn merged 6 commits intoelastic:mainfrom
dnhatn:fallback-reader

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 13, 2026

Today, we do not use the block loader of the sub-keyword field when loading the text field if ignore_above is set. When ignore_above is configured, values exceeding the threshold are not stored for the keyword field, which is why we cannot load from the sub-keyword field alone. However, if all documents in a segment have values below the threshold, we can safely load values from doc_values instead of stored fields. If some documents exceed the threshold, we should load values from doc_values for those below the threshold and from stored fields for those above.

This PR leverages the terms dictionary from the _ignored field to prefer loading values from doc_values of the sub-keyword field when possible. For any document where the sub-keyword field appears in the _ignored dictionary, we load from stored fields or _source; otherwise, we use doc_values. This improves performance when loading text fields, especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field, but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

@dnhatn dnhatn closed this Jan 13, 2026
@dnhatn dnhatn reopened this Jan 14, 2026
@dnhatn dnhatn force-pushed the fallback-reader branch 10 times, most recently from 6e706dd to a6af71c Compare January 15, 2026 00:56
@dnhatn dnhatn changed the title WIP - loader Jan 15, 2026
@dnhatn dnhatn changed the title Use sub keyword block loader ignore_above for text fields Jan 15, 2026
@dnhatn dnhatn added :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL >bug v9.3.1 v9.2.5 auto-backport Automatically create backport pull requests when merged labels Jan 15, 2026
@dnhatn dnhatn requested a review from martijnvg January 15, 2026 02:56
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn requested a review from kkrik-es January 15, 2026 04:26
@dnhatn dnhatn marked this pull request as ready for review January 15, 2026 04:27
@dnhatn dnhatn requested a review from nik9000 January 15, 2026 04:27
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

}

private BlockLoader nonDelegateBlockLoader(BlockLoaderContext blContext) {
// 2. check if we can load from a parent field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this above the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well numbering is off.. we can just remove them from all comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 8fb63c3

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Martijn has a better view of the work in text fields so I'll let him approve.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nhat, I like this solution!

In a followup, we can look into use a block loader that uses the ignored values that are stored in binary doc values (in main only) as fallback, instead of falling back to source.

* (under the limit) and doc-2 has the value "bcd..." (exceeds the limit), we can load doc-1 from the doc_values
* of keyword field and doc-2 from the slower stored fields.
*/
abstract class ConditionalBlockLoader implements BlockLoader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new construct!

return fallbackLoader;
}

private BlockLoader nonDelegateBlockLoader(BlockLoaderContext blContext) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can still fallback to using source here. So I think the #140687 backport pr still makes sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the backport of #140687 is still very important.

Comment on lines +392 to +399
public boolean supportsOrdinals() {
return false;
}

@Override
public SortedSetDocValues ordinals(LeafReaderContext context) throws IOException {
return null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a follow up pr, these methods can be removed? I don't see this being used any more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ will do.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 15, 2026

In a followup, we can look into use a block loader that uses the ignored values that are stored in binary doc values (in main only) as fallback, instead of falling back to source.

Yes, I will do that. I also think we need to apply this change to other text-family types.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 15, 2026

@martijnvg @kkrik-es Thanks for reviewing!

@dnhatn dnhatn merged commit f15069a into elastic:main Jan 15, 2026
34 of 35 checks passed
@dnhatn dnhatn deleted the fallback-reader branch January 15, 2026 21:05
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 140622

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 15, 2026
…ic#140622)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
@dnhatn
Copy link
Member Author

dnhatn commented Jan 15, 2026

💚 All backports created successfully

Status Branch Result
9.3
9.2

Questions ?

Please refer to the Backport tool documentation

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 15, 2026
…ic#140622)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 16, 2026
…ic#140622)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
elasticsearchmachine pushed a commit that referenced this pull request Jan 16, 2026
…) (#140787)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 16, 2026
…ic#140622)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
elasticsearchmachine pushed a commit that referenced this pull request Jan 16, 2026
…) (#140789)

Today, we do not use the block loader of the sub-keyword field when
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field,
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.

(cherry picked from commit f15069a)
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
…ic#140622)

Today, we do not use the block loader of the sub-keyword field when 
loading the text field if ignore_above is set. When ignore_above is
configured, values exceeding the threshold are not stored for the
keyword field, which is why we cannot load from the sub-keyword field
alone. However, if all documents in a segment have values below the
threshold, we can safely load values from doc_values instead of stored
fields. If some documents exceed the threshold, we should load values
from doc_values for those below the threshold and from stored fields for
those above.

This PR leverages the terms dictionary from the _ignored field to prefer 
loading values from doc_values of the sub-keyword field when possible.
For any document where the sub-keyword field appears in the _ignored
dictionary, we load from stored fields or _source; otherwise, we use
doc_values. This improves performance when loading text fields,
especially for logsdb.

There is a bug with FLS where we blindly delegate the sub-keyword field, 
but it may be hidden by FLS. I will address this in a follow-up.

Marking this as a bug fix for performance issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :StorageEngine/ES|QL Timeseries / metrics / PromQL / logsdb capabilities in ES|QL Team:StorageEngine v9.2.5 v9.3.1 v9.4.0

4 participants