HybridChunker with tiktoken tokenizer #1031
-
|
Hi, What would be the recommended way to use Thank you for the amazing library! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 8 replies
-
|
Hi @ruizguille 👋 At the moment, That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken. So one could expand the 👉 Based on the usage of
Would you be interested in submitting a PR yourself? 🙌 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @vagenas, Thanks a lot for the details! |
Beta Was this translation helpful? Give feedback.
-
|
This is code for a wrapper that I used. The code started from the work of Dave Ebbelaar on YouTube and was modified from there. It worked well for me. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @vagenas, I have been diving into the code and wanted to clarify a couple of things before submitting the PR.
Thank you. |
Beta Was this translation helpful? Give feedback.
-
|
@vagenas Is there any way to run docling with hybridChunker in a serverless environment? |
Beta Was this translation helpful? Give feedback.
-
|
What is the status of this? Can we use openai's embedding models with the hybrid chunker out of the box?
But this doesn't reference the embedding models as far as I understand. What should I do? |
Beta Was this translation helpful? Give feedback.



Hi @ruizguille 👋
At the moment,
HybridChunkeritself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.
So one could expand the
HybridChunkersuch that it can operate with both, by allowingself._tokenizerbe of that Union.👉 Based on the usage of
self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):