Skip to content

Create new block when filter OrdinalBytesRefBlock#136444

Merged
dnhatn merged 5 commits intoelastic:mainfrom
dnhatn:fix-filter-ordinals
Oct 15, 2025
Merged

Create new block when filter OrdinalBytesRefBlock#136444
dnhatn merged 5 commits intoelastic:mainfrom
dnhatn:fix-filter-ordinals

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 11, 2025

Currently, when filtering OrdinalBytesRefBlock and OrdinalBytesRefVector, we reuse the existing dictionary and filter only the ordinals. Although this is semantically correct, it can break HashAggregationOperator when ordinal optimizations are enabled in BlockHash.

Example: given an incoming OrdinalBytesRefBlock after filtering:

  • dict: [apple, banana, orange]
  • ordinals: [0, 0, 0, 0, 1, 1]

Here, orange is not referenced. During grouping and hashing, all dictionary entries [apple, banana, orange] are hashed with group IDs [1, 2, 3] (0 reserved for null), and the block produces groupings [1, 1, 1, 1, 2, 2]. The output is correct, but when evaluating partial/final results, we cannot exclude orange (groupId=3) unless we enter slow mode and track every seen group. This can cause an IndexOutOfBoundsException if orange is the largest group and the over-allocation is not enough to cover it, or orange may be included in the result even though it was excluded previously.

This change enforces the creation of a new Block/Vector when filtering OrdinalBytesRefBlock and OrdinalBytesRefVector.

Closes #136423

@dnhatn dnhatn added :Analytics/ES|QL AKA ESQL >bug v9.2.1 v9.1.6 v8.19.6 auto-backport Automatically create backport pull requests when merged labels Oct 11, 2025
@dnhatn dnhatn requested a review from nik9000 October 11, 2025 05:34
@dnhatn dnhatn marked this pull request as ready for review October 11, 2025 05:34
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

15 | Senior Team Lead
11 | Support Engineer
15 | Tech Lead
;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, Purchase Manager would re-appear in the result.

employees:long | job_positions:keyword
18             | Accountant
13             | Architect
11             | Business Analyst
13             | Data Scientist
10             | Head Human Resources
15             | Internship
14             | Junior Developer
20             | Principal Support Engineer
0              | Purchase Manager
13             | Python Developer
15             | Reporting Analyst
20             | Senior Python Developer
15             | Senior Team Lead
11             | Support Engineer
15             | Tech Lead
@nik9000
Copy link
Member

nik9000 commented Oct 15, 2025

Oh that one's exciting!

@dnhatn
Copy link
Member Author

dnhatn commented Oct 15, 2025

Thanks Nik!

@dnhatn dnhatn merged commit d83584c into elastic:main Oct 15, 2025
34 checks passed
@dnhatn dnhatn deleted the fix-filter-ordinals branch October 15, 2025 15:14
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts
9.2

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 136444

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Oct 15, 2025
Currently, when filtering OrdinalBytesRefBlock and 
OrdinalBytesRefVector, we reuse the existing dictionary and filter only
the ordinals. Although this is semantically correct, it can break
HashAggregationOperator when ordinal optimizations are enabled in
BlockHash.

Example: given an incoming OrdinalBytesRefBlock after filtering:

dict: [apple, banana, orange]
ordinals: [0, 0, 0, 0, 1, 1]

Here, orange is not referenced. During grouping and hashing, all 
dictionary entries [apple, banana, orange] are hashed with group IDs [1,
2, 3] (0 reserved for null), and the block produces groupings [1, 1, 1,
1, 2, 2]. The output is correct, but when evaluating partial/final
results, we cannot exclude orange (groupId=3) unless we enter slow mode
and track every seen group. This can cause an IndexOutOfBoundsException
if orange is the largest group and the over-allocation is not enough to
cover it, or orange may be included in the result even though it was
excluded previously.

This change enforces the creation of a new Block/Vector when filtering 
OrdinalBytesRefBlock and OrdinalBytesRefVector.

Closes elastic#136423
elasticsearchmachine pushed a commit that referenced this pull request Oct 15, 2025
Currently, when filtering OrdinalBytesRefBlock and 
OrdinalBytesRefVector, we reuse the existing dictionary and filter only
the ordinals. Although this is semantically correct, it can break
HashAggregationOperator when ordinal optimizations are enabled in
BlockHash.

Example: given an incoming OrdinalBytesRefBlock after filtering:

dict: [apple, banana, orange]
ordinals: [0, 0, 0, 0, 1, 1]

Here, orange is not referenced. During grouping and hashing, all 
dictionary entries [apple, banana, orange] are hashed with group IDs [1,
2, 3] (0 reserved for null), and the block produces groupings [1, 1, 1,
1, 2, 2]. The output is correct, but when evaluating partial/final
results, we cannot exclude orange (groupId=3) unless we enter slow mode
and track every seen group. This can cause an IndexOutOfBoundsException
if orange is the largest group and the over-allocation is not enough to
cover it, or orange may be included in the result even though it was
excluded previously.

This change enforces the creation of a new Block/Vector when filtering 
OrdinalBytesRefBlock and OrdinalBytesRefVector.

Closes #136423
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Oct 16, 2025
Currently, when filtering OrdinalBytesRefBlock and 
OrdinalBytesRefVector, we reuse the existing dictionary and filter only
the ordinals. Although this is semantically correct, it can break
HashAggregationOperator when ordinal optimizations are enabled in
BlockHash.

Example: given an incoming OrdinalBytesRefBlock after filtering:

dict: [apple, banana, orange]
ordinals: [0, 0, 0, 0, 1, 1]

Here, orange is not referenced. During grouping and hashing, all 
dictionary entries [apple, banana, orange] are hashed with group IDs [1,
2, 3] (0 reserved for null), and the block produces groupings [1, 1, 1,
1, 2, 2]. The output is correct, but when evaluating partial/final
results, we cannot exclude orange (groupId=3) unless we enter slow mode
and track every seen group. This can cause an IndexOutOfBoundsException
if orange is the largest group and the over-allocation is not enough to
cover it, or orange may be included in the result even though it was
excluded previously.

This change enforces the creation of a new Block/Vector when filtering 
OrdinalBytesRefBlock and OrdinalBytesRefVector.

Closes elastic#136423
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.6 v9.1.6 v9.2.1 v9.3.0

3 participants