ESQL: CATEGORIZE as a BlockHash by nik9000 · Pull Request #114317 · elastic/elasticsearch

nik9000 · 2024-10-08T13:30:43Z

Re-implement CATEGORIZE in a way that works for multi-node clusters.

This requires that data is first categorized on each data node in a first pass, then the categorizers from each data node are merged on the coordinator node and previously categorized rows are re-categorized.

BlockHashes, used in HashAggregations, already work in a very similar way. E.g. for queries like ... | STATS ... BY field1, field2 they map values for field1 and field2 to unique integer ids that are then passed to the actual aggregate functions to identify which "bucket" a row belongs to. When passed from the data nodes to the coordinator, the BlockHashes are also merged to obtain unique ids for every value in field1, field2 that is seen on the coordinator (not only on the local data nodes).

Therefore, we re-implement CATEGORIZE as a special BlockHash.

To choose the correct BlockHash when a query plan is mapped to physical operations, the AggregateExec query plan node needs to know that we will be categorizing the field message in a query containing ... | STATS ... BY c = CATEGORIZE(message). For this reason, we do not extract the expression c = CATEGORIZE(message) into an EVAL node, in contrast to e.g. STATS ... BY b = BUCKET(field, 10). The expression c = CATEGORIZE(message) simply remains inside the AggregateExec's groupings.

Important limitation: For now, to use CATEGORIZE in a STATS command, there can be only 1 grouping (the CATEGORIZE) overall.

elasticsearchmachine · 2024-10-08T18:10:51Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-10-08T18:11:54Z

Hi @nik9000, I've created a changelog YAML for you.

somehow

This makes them easier to be tested.

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ElementType.java

muted-tests.yml

...c/main/java/org/elasticsearch/compute/aggregation/blockhash/AbstractCategorizeBlockHash.java

x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec

alex-spies

Heya, let's get this green and into main. The only work before merging is, IMO, getting the mutes/capabilities right and having correct expectations for the csv test with nulls - even if that means it needs muting, see below.

There is some immediate follow-up work that needs to be done, but that can be in subsequent PRs. I'm summarizing it here because the many comments are hard to navigate.

Remaining csv test cases:

STATS a = c, b = c BY c = CATEGORIZE(message)
from test | STATS MV_COUNT(cat), COUNT(*) BY cat = CATEGORIZE(first_name)
| stats mv_count(categorize(message)) by categorize(message)

Correct hashing of multivalues (+ test against regressions), see #114317 (comment).

Block hash tests:

more cases #114317 (comment)
stronger assertion #114317 (comment)

FoldNull: check if this change is necessary and add a comment as to why if that's still the case #114317 (comment)

Ideally:

simplify the changes to CombineProjections #114317 (comment)

alex-spies · 2024-11-27T09:50:43Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec

+FROM sample_data
+  | EVAL x = null
+  | STATS COUNT() BY category=CATEGORIZE(x)
+  | SORT category
+;
+
+COUNT():long | category:keyword
+;


Let's mark this with -Ignore but let's put the correct expectation here - and in the test below it.

ivancea · 2024-11-27T11:04:48Z

...c/main/java/org/elasticsearch/compute/aggregation/blockhash/AbstractCategorizeBlockHash.java

+ * Base BlockHash implementation for {@code Categorize} grouping function.
+ */
+public abstract class AbstractCategorizeBlockHash extends BlockHash {
+    // TODO: this should probably also take an emitBatchSize


@nik9000 Some info on this?

TLDR: It's probably not important for the single-element BlockHash implementations like this.

So emitBatchSize is a request to call AddInput#add every emitBatchSize entries. It's designed to prevent building a huge page of ordinals when processing STATS BY a, b and a and b are no single valued - especially if they are both multivaued. There it's the contract to put the row into an ordering for all combinations of a and b values. Since that can explode into a huge number of rows, we batch it.

This is much much less important for single element BlockHash implementations. They don't change the number of output rows. That's true even for CATEGORIZE. And if the incoming page already "wasn't too big" then the page of ordinals passed to the aggs can't be that big either.

When I first built this I thought I might apply this to single valued BlockHash implementations as well. It'd be consistent. It's lame to ignore this request. But it isn't important so I never got to it.

ivancea · 2024-11-27T11:08:32Z

...c/main/java/org/elasticsearch/compute/aggregation/blockhash/AbstractCategorizeBlockHash.java

+            out.writeVInt(categorizer.getCategoryCount());
+            for (SerializableTokenListCategory category : categorizer.toCategoriesById()) {
+                category.writeTo(out);
+            }
+            // We're returning a block with N positions just because the Page must have all blocks with the same position count!
+            return blockFactory.newConstantBytesRefBlockWith(out.bytes().toBytesRef(), categorizer.getCategoryCount());


Do we really need to wirite the vInt and the Pages positions hack? Can't we just write a position per category? To be more like ESQL

Not sure what you exactly mean. The number of categories is not equal to the number of inputs texts, meaning you still have a mismatch in number of positions.

We're building here the intermediate state to pass to the CategorizeIntermediateHashBlock, with 1 row/position per category. So I imagine we can do this in 2 ways:

The current one:

Serialize into a BytesRef with: - int (# of categories) - every category And send this BytesRef in a block with N "simulated" rows

Instead, do:

Write a block with a category (Serialized in a BytesRef) per position/row. So we don't simulate anything, and we could even consume just 2 categories later and discard the rest (Maybe? Just a _funny_ possibility, not sure if it's possible)

My memory was that the state was one blob of bytes and not a blob per category. There's, like, shared state. But it's been a month since I thought a lot about this. And I'm wrong about lots of stuff.

ivancea · 2024-11-27T11:13:40Z

...in/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/BlockHash.java

+    ) {
+        if (groups.stream().anyMatch(GroupSpec::isCategorize)) {
+            if (groups.size() != 1) {
+                throw new IllegalArgumentException("only a single CATEGORIZE group can used");


Typo. Maybe also something like:

Suggested change

throw new IllegalArgumentException("only a single CATEGORIZE group can used");

throw new IllegalArgumentException("if a CATEGORIZE group is present, no other groups are allowed");

ivancea · 2024-11-27T11:44:02Z

...te/src/main/java/org/elasticsearch/compute/aggregation/blockhash/CategorizeRawBlockHash.java

+                    int end = first + count;
+                    for (int i = first; i < end; i++) {
+                        result.appendInt(process(vBlock.getBytesRef(i, vScratch)));
+                    }


Indeed, it's broken. We didn't see it because all of our tests with multivalues use COUNT(), which just increments 1 and doesn't use any other field 💀

Fixing now and added extra tests and functions

ivancea · 2024-11-27T12:03:56Z

...n/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java

            return new HashAggregationOperator(
                aggregators,
-                () -> BlockHash.build(groups, driverContext.blockFactory(), maxPageSize, false),
+                () -> BlockHash.build(groups, aggregatorMode, driverContext.blockFactory(), maxPageSize, false),


We should probably make this change in the follow-up!

Copilot

Copilot reviewed 20 out of 35 changed files in this pull request and generated no suggestions.

Files not reviewed (15)

docs/reference/esql/functions/kibana/definition/categorize.json: Language not supported
docs/reference/esql/functions/types/categorize.asciidoc: Language not supported
x-pack/plugin/esql/qa/testFixtures/src/main/resources/mapping-mv_sample_data.json: Language not supported
x-pack/plugin/esql/qa/testFixtures/src/main/resources/mv_sample_data.csv: Language not supported
muted-tests.yml: Evaluated as low risk
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/FoldNull.java: Evaluated as low risk
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/CombineProjections.java: Evaluated as low risk
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java: Evaluated as low risk
x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashAggregationOperator.java: Evaluated as low risk
x-pack/plugin/esql/compute/src/test/java/org/elasticsearch/compute/aggregation/GroupingAggregatorFunctionTestCase.java: Evaluated as low risk
x-pack/plugin/esql/compute/src/test/java/org/elasticsearch/compute/aggregation/blockhash/BlockHashTestCase.java: Evaluated as low risk
x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java: Evaluated as low risk
x-pack/plugin/esql/compute/src/test/java/org/elasticsearch/compute/operator/HashAggregationOperatorTests.java: Evaluated as low risk
x-pack/plugin/esql/compute/src/test/java/org/elasticsearch/compute/aggregation/blockhash/BlockHashTests.java: Evaluated as low risk
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java: Evaluated as low risk

elasticsearchmachine · 2024-11-27T16:46:04Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 114317

…lastic#117367) Set/Collection#add() is supposed to return `true` if the collection changed (If it actually added something). In this case, it must return if the old value is null. Extracted from elastic#114317 (Where it's being used)

Re-implement `CATEGORIZE` in a way that works for multi-node clusters. This requires that data is first categorized on each data node in a first pass, then the categorizers from each data node are merged on the coordinator node and previously categorized rows are re-categorized. BlockHashes, used in HashAggregations, already work in a very similar way. E.g. for queries like `... | STATS ... BY field1, field2` they map values for `field1` and `field2` to unique integer ids that are then passed to the actual aggregate functions to identify which "bucket" a row belongs to. When passed from the data nodes to the coordinator, the BlockHashes are also merged to obtain unique ids for every value in `field1, field2` that is seen on the coordinator (not only on the local data nodes). Therefore, we re-implement `CATEGORIZE` as a special BlockHash. To choose the correct BlockHash when a query plan is mapped to physical operations, the `AggregateExec` query plan node needs to know that we will be categorizing the field `message` in a query containing `... | STATS ... BY c = CATEGORIZE(message)`. For this reason, _we do not extract the expression_ `c = CATEGORIZE(message)` into an `EVAL` node, in contrast to e.g. `STATS ... BY b = BUCKET(field, 10)`. The expression `c = CATEGORIZE(message)` simply remains inside the `AggregateExec`'s groupings. **Important limitation:** For now, to use `CATEGORIZE` in a `STATS` command, there can be only 1 grouping (the `CATEGORIZE`) overall.

nik9000 added 3 commits October 4, 2024 11:37

WIP

a4647fc

Merge branch 'main' into esql_blockh2

9cb4425

Stay plz:

95f767b

nik9000 requested review from alex-spies and jan-elastic October 8, 2024 13:30

elasticsearchmachine added v9.0.0 needs:triage Requires assignment of a team area label labels Oct 8, 2024

mayya-sharipova added the :Analytics/ES|QL AKA ESQL label Oct 8, 2024

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Oct 8, 2024

mayya-sharipova added the >enhancement label Oct 8, 2024

Update docs/changelog/114317.yaml

e4b9e4b

alex-spies and others added 5 commits October 17, 2024 18:23

Make things compile

f2d1806

somehow

Move new block hashes to typical location

a309133

This makes them easier to be tested.

Add almost passing test

e50d5b9

Close stuff (makes unit test pass)

5674046

undo making BytesRefBlockHash public

31e9e20

jan-elastic force-pushed the esql_blockh2 branch from 041609a to 31e9e20 Compare October 24, 2024 11:29

jan-elastic added 7 commits October 28, 2024 14:30

Move Categorize BlockHash tests to separate file

82cc74a

Unit test for CategorizedIntermediateBlockHash.

a234326

Unit test with Driver

3f94143

Add aggregator to unit test

239d159

Passing unit test with driver and aggregators

260b572

Improve code

975d6ef

Fix ES compile errors

f44774d

costin reviewed Nov 6, 2024

View reviewed changes

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ElementType.java Outdated Show resolved Hide resolved

jan-elastic added 2 commits November 7, 2024 17:34

Update Categorize function/evaluator (works end2end in discover)

f7264f2

fix CategorizeTests

5b1cfc1

ivancea and others added 5 commits November 26, 2024 18:11

Fixed failing test after indices change

e0b84f0

Merge branch 'main' into esql_blockh2

9d1bfd3

Unmute Categorize tests

6cf6ec1

Added a bunch of other tests from PR suggestions

f677b47

Removed semantic text index, which is failing in cross cluster

f37d2bd

alex-spies reviewed Nov 27, 2024

View reviewed changes

ivancea added 2 commits November 27, 2024 10:52

Improved optimizer tests on categorize

d10709b

Unmute last categorize test

b622e6c

alex-spies approved these changes Nov 27, 2024

View reviewed changes

ivancea added 5 commits November 27, 2024 11:48

Rename capability to CATEGORIZE_V2

30412ad

Ignore and fix null tests

74aeae3

Fixed multivalues on Categorize and added tests using other fields

3771532

Use checked circuit breaker and fix CategorizeBlockHashTests

599b02b

Restore Aggregate computeReferences()

3763f74

ivancea reviewed Nov 27, 2024

View reviewed changes

Remove AggregatorMode parameter from BlockHash.build(), and format all

d2325aa

ivancea approved these changes Nov 27, 2024

View reviewed changes

ivancea added the auto-merge label Nov 27, 2024

ivancea requested a review from Copilot November 27, 2024 13:25

Copilot AI reviewed Nov 27, 2024

View reviewed changes

Merge branch 'main' into esql_blockh2

43a335f

ivancea merged commit 9022ccc into elastic:main Nov 27, 2024

elasticsearchmachine added the backport pending label Nov 27, 2024

ivancea mentioned this pull request Nov 28, 2024

[8.x] Backport "ESQL: CATEGORIZE as a BlockHash" #117699

Merged

ivancea removed the backport pending label Nov 28, 2024

alex-spies mentioned this pull request Dec 3, 2024

ESQL: Simplify CombineProjections #117882

Merged

nik9000 mentioned this pull request Jan 6, 2025

ESQL: Compute engine support for stateful grouping functions #112757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: CATEGORIZE as a BlockHash#114317

ESQL: CATEGORIZE as a BlockHash#114317
ivancea merged 85 commits intoelastic:mainfrom
nik9000:esql_blockh2

nik9000 commented Oct 8, 2024 •

edited by alex-spies

Loading

elasticsearchmachine commented Oct 8, 2024

elasticsearchmachine commented Oct 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-spies left a comment

alex-spies Nov 27, 2024 •

edited

Loading

ivancea Nov 27, 2024

nik9000 Dec 2, 2024

ivancea Nov 27, 2024

jan-elastic Nov 27, 2024

ivancea Nov 27, 2024

nik9000 Dec 2, 2024

ivancea Nov 27, 2024

ivancea Nov 27, 2024

ivancea Nov 27, 2024 •

edited

Loading

Copilot AI left a comment

elasticsearchmachine commented Nov 27, 2024

Labels

8 participants

	throw new IllegalArgumentException("only a single CATEGORIZE group can used");
	throw new IllegalArgumentException("if a CATEGORIZE group is present, no other groups are allowed");

Conversation

nik9000 commented Oct 8, 2024 • edited by alex-spies Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Oct 8, 2024

elasticsearchmachine commented Oct 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

alex-spies Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivancea Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Nov 27, 2024

💔 Backport failed

Labels

8 participants

nik9000 commented Oct 8, 2024 •

edited by alex-spies

Loading

alex-spies Nov 27, 2024 •

edited

Loading

ivancea Nov 27, 2024 •

edited

Loading