Add recursive chunker#126866
Conversation
|
Hi @dan-rubinstein, I've created a changelog YAML for you. |
...nference/src/test/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunkerTests.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/ml-core (Team:ML) |
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SeparatorSet.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
| for (int i = 0; i < numSentences - 1; i++) { | ||
| splittersAfterSentences.add(randomFrom(validSplittersAfterSentences)); | ||
| } | ||
| RecursiveChunkingSettings settings = generateChunkingSettings(15, separators); |
There was a problem hiding this comment.
Because of the small chunk size the generated chunks will never contain more than 1 sentence. Can you structure the test so that some chunks contain multiple heading sections.
For example if, if chunks size was 100 words and given the document
# heading1.1
## heading1.2.1
TEST_SENTENCE * 3
## heading1.2.2
TEST_SENTENCE * 2
# heading2.1
## heading2.2.1
TEST_SENTENCE * 9
## heading2.2.2
TEST_SENTENCE
### heading2.3.1
TEST_SENTENCE
### heading2.3.2
TEST_SENTENCE
In this case, given an ordered list of separators, I would expect # heading1.1 -> # heading2.1 to be a single chunks. Then 2 more chunks for heading2.2.1 and heading2.2.2
Please add tests on longer documents that capture the hierarchical nature of the chunker
There was a problem hiding this comment.
Discussed offline with Dave. Adding this into the existing long document tests that randomly generate a document would require essentially re-writing the chunking logic into the testing file to generate the expected chunk limits. We've instead decided it makes sense to add a new test with a smaller fixed length document to cover this case.
|
@elasticmachine merge upstream |
💚 Backport successful
|
* Add recursive chunker (#126866) * Add recursive chunker * Update docs/changelog/126866.yaml * Clean up separator sets and add asMap function for RecrusiveChunkingSettings * Add javadoc for chunker, add tests, reduce word counting operations * Remove split merging and add long document unit test * [CI] Auto commit changes from spotless * Add markdown chunking tests and reduce substring calls * Clean up matcher logic * Add testing for not splitting after valid chunk is found --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> * Update getFirst to get in recursive chunker tests --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Add recursive chunker * Update docs/changelog/126866.yaml * Clean up separator sets and add asMap function for RecrusiveChunkingSettings * Add javadoc for chunker, add tests, reduce word counting operations * Remove split merging and add long document unit test * [CI] Auto commit changes from spotless * Add markdown chunking tests and reduce substring calls * Clean up matcher logic * Add testing for not splitting after valid chunk is found --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Add recursive chunker * Update docs/changelog/126866.yaml * Clean up separator sets and add asMap function for RecrusiveChunkingSettings * Add javadoc for chunker, add tests, reduce word counting operations * Remove split merging and add long document unit test * [CI] Auto commit changes from spotless * Add markdown chunking tests and reduce substring calls * Clean up matcher logic * Add testing for not splitting after valid chunk is found --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Description
This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.
Example usage:
Testing
CONTRIBUTING.mddocument.