ES|QL categorize with multiple groupings#118173
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @jan-elastic, I've created a changelog YAML for you. |
099e143 to
7409695
Compare
7409695 to
35e9811
Compare
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
|
Hi @jan-elastic, I've created a changelog YAML for you. |
x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/Page.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
ivancea
left a comment
There was a problem hiding this comment.
Nice work 👀
Some nits and new tests in comments; the impl looks quite solid to me
...in/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/BlockHash.java
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
| if (id == 0) { | ||
| builder.appendNull(); | ||
| } else { | ||
| builder.appendBytesRef(regexes.getBytesRef(id + idsOffset, scratch)); |
There was a problem hiding this comment.
Not for now: We're repeating, potentially, a lot of bytesref values here. I wonder if there is or it would make sense to have a BytesRefBlock that instead of all the bytesrefs, stores every value just once, and then a reference per position:
AAAAAA
BBBBBBB
AAAAAA
AAAAAA
->
// 1: AAAAAA
// 2: BBBBBBB
1
2
1
1
@nik9000 Something to consider for later? Maybe it's too specific for this. And anyway, the next EVAL or whatever will duplicate the value again.
There was a problem hiding this comment.
That sounds like a nice thing to have, but definitely out of scope for this PR.
However, the next EVAL should not duplicate the value again.
If you have:
// 1: AAAAAA
// 2: BBBBBBB
1
2
1
1
then an efficient EVAL x=SUBSTRING(x, 1, 2) should give
// 1: AA
// 2: BB
1
2
1
1
without ever duplicating.
There was a problem hiding this comment.
For that SUBSTRING to not duplicate, we would need to add that "hashtable" strategy in the BytesRefBlockBuilder. It looks goo (?), but I wonder if using that by default could perform negatively in some scenarios. Something to try eventually probably
There was a problem hiding this comment.
Sounds like worth trying in the future. Are you making a note (issue) of this, so that the idea doesn't get lost?
There was a problem hiding this comment.
Sure! I'll comment it with Nik, just in case it was considered and discarded already, and then I'll document it in an issue somewhere
There was a problem hiding this comment.
Heya, this generally looks very good, thank you @jan-elastic !
I only have one observation about potential untracked memory that I think we should ponder at least a bit to ensure we're safe; see below.
...in/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/BlockHash.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/categorize.csv-spec
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Show resolved
Hide resolved
alex-spies
left a comment
There was a problem hiding this comment.
Let's wait with merging until we had a chance to look at this subtle point.
...va/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHashTests.java
Show resolved
Hide resolved
|
@alex-spies Thanks for the review! Processed all your comments, except the OOM one, which needs more discussion. PTAL |
alex-spies
left a comment
There was a problem hiding this comment.
Thanks @jan-elastic !
Let's take care of the memory accounting in a follow-up.
Maybe we want to add a couple more block hash test cases, but otherwise this LGTM and should be unblocked :)
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Outdated
Show resolved
Hide resolved
...va/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHashTests.java
Show resolved
Hide resolved
…egorize-multiple-groupings
...in/java/org/elasticsearch/compute/aggregation/blockhash/CategorizePackedValuesBlockHash.java
Show resolved
Hide resolved
…egorize-multiple-groupings
…egorize-multiple-groupings
💔 Backport failed
You can use sqren/backport to manually backport by running |
* ES|QL categorize with multiple groupings. * Fix VerifierTests * Close stuff when constructing CategorizePackedValuesBlockHash fails * CategorizePackedValuesBlockHashTests * Improve categorize javadocs * Update docs/changelog/118173.yaml * Create CategorizePackedValuesBlockHash's deletegate page differently * Double check in BlockHash builder for single categorize * Reuse blocks array * More CSV tests * Remove assumeTrue categorize_v5 * Rename test * Two more verifier tests * more CSV tests * Add JavaDocs/comments * spotless * Refactor/unify recategorize * Better memory accounting * fix csv test * randomize CategorizePackedValuesBlockHashTests * Add TODO
* ES|QL categorize with multiple groupings. * Fix VerifierTests * Close stuff when constructing CategorizePackedValuesBlockHash fails * CategorizePackedValuesBlockHashTests * Improve categorize javadocs * Update docs/changelog/118173.yaml * Create CategorizePackedValuesBlockHash's deletegate page differently * Double check in BlockHash builder for single categorize * Reuse blocks array * More CSV tests * Remove assumeTrue categorize_v5 * Rename test * Two more verifier tests * more CSV tests * Add JavaDocs/comments * spotless * Refactor/unify recategorize * Better memory accounting * fix csv test * randomize CategorizePackedValuesBlockHashTests * Add TODO Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
…8590) * ES|QL categorize with multiple groupings. * Fix VerifierTests * Close stuff when constructing CategorizePackedValuesBlockHash fails * CategorizePackedValuesBlockHashTests * Improve categorize javadocs * Update docs/changelog/118173.yaml * Create CategorizePackedValuesBlockHash's deletegate page differently * Double check in BlockHash builder for single categorize * Reuse blocks array * More CSV tests * Remove assumeTrue categorize_v5 * Rename test * Two more verifier tests * more CSV tests * Add JavaDocs/comments * spotless * Refactor/unify recategorize * Better memory accounting * fix csv test * randomize CategorizePackedValuesBlockHashTests * Add TODO Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
…8590) * ES|QL categorize with multiple groupings. * Fix VerifierTests * Close stuff when constructing CategorizePackedValuesBlockHash fails * CategorizePackedValuesBlockHashTests * Improve categorize javadocs * Update docs/changelog/118173.yaml * Create CategorizePackedValuesBlockHash's deletegate page differently * Double check in BlockHash builder for single categorize * Reuse blocks array * More CSV tests * Remove assumeTrue categorize_v5 * Rename test * Two more verifier tests * more CSV tests * Add JavaDocs/comments * spotless * Refactor/unify recategorize * Better memory accounting * fix csv test * randomize CategorizePackedValuesBlockHashTests * Add TODO Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
No description provided.