HybridChunker with tiktoken tokenizer #1031

ruizguille · 2025-02-20T20:11:21Z

ruizguille
Feb 20, 2025

Hi,

What would be the recommended way to use HybridChunker with the tiktoken tokenizer for use with OpenAI's embedding models? Reading the code, it seems to accept only HuggingFace tokenizers.

Thank you for the amazing library!

Answered by vagenas

Mar 3, 2025

Hi @ruizguille 👋

At the moment, HybridChunker itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).

That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.

So one could expand the HybridChunker such that it can operate with both, by allowing self._tokenizer be of that Union.

👉 Based on the usage of self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):

the max tokens allowed for the model
the number of tokens a given piece of text would correspond to
additionally, as string input is still supported, one would have to decide which of HF / tiktoken that would get mapped…

View full answer

vagenas · 2025-03-03T09:03:18Z

vagenas
Mar 3, 2025
Maintainer

Hi @ruizguille 👋

At the moment, HybridChunker itself indeed only supports HF tokenizers (transformers.PreTrainedTokenizerBase).

That said, the actual text splitting library used in parts of the workflow —semchunk— already supports tiktoken.

So one could expand the HybridChunker such that it can operate with both, by allowing self._tokenizer be of that Union.

👉 Based on the usage of self._tokenizer, one would need to resolve the following for tiktoken (equivalently to HF):

the max tokens allowed for the model
the number of tokens a given piece of text would correspond to
additionally, as string input is still supported, one would have to decide which of HF / tiktoken that would get mapped to (I'd say HF for backwards compatibility — but one could also pursue both on a trial-and-error basis)

Would you be interested in submitting a PR yourself? 🙌

0 replies

ruizguille · 2025-03-04T15:38:40Z

ruizguille
Mar 4, 2025
Author

Hi @vagenas,

Thanks a lot for the details!
Sounds great, it would be very useful to have tiktoken support. I will take a look at the code in depth and submit the PR.

0 replies

mustafa-alshawaf · 2025-03-10T16:11:53Z

mustafa-alshawaf
Mar 10, 2025

This is code for a wrapper that I used. The code started from the work of Dave Ebbelaar on YouTube and was modified from there. It worked well for me.

from typing import Dict, List, Tuple

from tiktoken import get_encoding
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, BatchEncoding, AddedToken


class OpenAITokenizerWrapper(PreTrainedTokenizerBase):
    """Minimal wrapper for OpenAI's tokenizer."""

    def __init__(
        self, model_name: str = "cl100k_base", max_length: int = 8191, **kwargs
    ):
        """Initialize the tokenizer.

        Args:
            model_name: The name of the OpenAI encoding to use
            max_length: Maximum sequence length
        """
        super().__init__(model_max_length=max_length, **kwargs)
        self.tokenizer = get_encoding(model_name)
        self._vocab_size = self.tokenizer.max_token_value
        # Don't initialize added_tokens_encoder/decoder here.
        # Do it in _init_special_tokens instead.

    def _init_special_tokens(self, **kwargs):
        """Initialize special token related attributes."""
        super()._init_special_tokens(**kwargs)
        self.added_tokens_encoder: Dict[str, int] = {}
        self.added_tokens_decoder: Dict[int, str] = {}
        self._added_tokens_set = set()
        self.add_special_tokens({'bos_token': '', 'eos_token': '', 'unk_token': ''})

    def _add_tokens(self, new_tokens: List[str | AddedToken], special_tokens: bool = False) -> int:
        """
        Handles adding tokens.
        """
        if not new_tokens:
            return 0

        for token in new_tokens:
            if isinstance(token, str):
                token_str = token
            else:  # AddedToken
                token_str = token.content
            if token_str not in self._added_tokens_set:
                self._added_tokens_set.add(token_str)
        return 0

    def _tokenize(self, text: str, **kwargs) -> List[str]:
        """Tokenize text (implementation for PreTrainedTokenizerBase)."""
        return [str(t) for t in self.tokenizer.encode(text, disallowed_special=())]

    def tokenize(self, text: str, **kwargs) -> List[str]:
        """
        Public tokenize method.  This is what external code calls.
        It handles kwargs and then calls _tokenize.
        """
        return self._tokenize(text, **kwargs)

    def _convert_token_to_id(self, token: str) -> int:
        """Convert a token to its ID."""
        return int(token)

    def _convert_id_to_token(self, index: int) -> str:
        """Convert an ID to its token."""
        return str(index)

    def get_vocab(self) -> Dict[str, int]:
        """Get the vocabulary."""
        return {str(i): i for i in range(self.vocab_size)}

    @property
    def vocab_size(self) -> int:
        """Get the vocabulary size."""
        return self._vocab_size

    def save_vocabulary(self, save_directory: str, filename_prefix: str = None) -> Tuple[str]:
        """Save is not needed for tiktoken."""
        return ()

    def _encode_plus(
        self,
        text: str,
        text_pair: str | None = None,
        add_special_tokens: bool = True,
        padding_strategy="do_not_pad",
        truncation_strategy="do_not_truncate",
        max_length: int | None = None,
        stride: int = 0,
        is_split_into_words: bool = False,
        pad_to_multiple_of: int | None = None,
        return_tensors=None,
        return_token_type_ids: bool | None = None,
        return_attention_mask: bool | None = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        **kwargs,
    ) -> BatchEncoding:
        """
        This method mimics the Hugging Face encode_plus method, but uses tiktoken.
        """

        if text_pair is not None:
            raise NotImplementedError("Text pair encoding is not supported.")
        if is_split_into_words:
            raise NotImplementedError("Splitting into words is not supported.")

        tokens = self.tokenizer.encode(text, disallowed_special=())

        if truncation_strategy != "do_not_truncate" and max_length is not None:
            tokens = tokens[:max_length]

        if padding_strategy != "do_not_pad":
            if max_length is None:
                raise ValueError("max_length must be specified for padding.")
            padding_length = max_length - len(tokens)
            if padding_length > 0:
                if pad_to_multiple_of is not None:
                    padding_length = (padding_length + pad_to_multiple_of - 1) // pad_to_multiple_of * pad_to_multiple_of - len(tokens)
                tokens.extend([0] * padding_length)

        encoding = BatchEncoding(
            {
                "input_ids": tokens,
                "attention_mask": [1] * len(tokens),
            },
            tensor_type=return_tensors,
        )

        return encoding


    def encode(
        self,
        text: str,
        text_pair: str | None = None,
        add_special_tokens: bool = True,
        padding="do_not_pad",
        truncation=False,
        max_length: int | None = None,
        **kwargs,
    ) -> List[int]:
        """Simplified encode method."""
        return self._encode_plus(text, text_pair, add_special_tokens, padding, truncation, max_length, **kwargs)["input_ids"]


    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, *args, **kwargs):
        """Class method to match HuggingFace's interface."""
        return cls(model_name=pretrained_model_name_or_path, **kwargs)

3 replies

monokizsolt Apr 23, 2025

This worked for me perfectly last week, but now I am getting error:

chunker = HybridChunker( ^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/pydantic/main.py", line 253, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.11/site-packages/docling_core/transforms/chunker/hybrid_chunker.py", line 73, in _patch if isinstance(data, dict) and (tokenizer := data.get("tokenizer")): File "/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1673, in __len__ raise NotImplementedError() NotImplementedError

After some search I found that it is looking for a len function. PreTrainedTokenizerBase does not implement it.
If i add len to this wrapper, it works again:

def __len__(self) -> int: return self._vocab_size

Does anyone know what this function should return? It does not seem to have an effect.

ajfabian-reea Sep 9, 2025

I'm getting the same error here. Does anyone have solved this?
UPD: I added this to the class and it seems to be working (at least not broken). IDK if it's correct tho.

def __len__(self) -> int:
        return self.vocab_size

monokizsolt Sep 10, 2025

@ajfabian-reea it is now part of docling:
https://docling-project.github.io/docling/examples/hybrid_chunking/#configuring-tokenization

ruizguille · 2025-03-13T18:07:41Z

ruizguille
Mar 13, 2025
Author

Hi @vagenas,

I have been diving into the code and wanted to clarify a couple of things before submitting the PR.

Since HybridChunker is a Pydantic model with runtime type validation, including tiktoken.Encoding as one of the possible types of tokenizer would require tiktoken as an additional dependency of the library (optional, for chunking only). Is that okay?
Semchunk, on the other hand, does not include tiktoken as a dependency, but they are not validating the types at runtime.
Tiktoken does not provide a model_max_length equivalent property or alternative method to infer it. I am leaning towards raising a validation error if max_tokens is not provided when using a tiktoken tokenizer. Is this reasonable? The other alternative I see is to maintain a dictionary mapping each OpenAI model to its context length which does not seem ideal and would have to be kept updated.

Thank you.
And thank you @mustafa-alshawaf for the wrapper.

2 replies

jayeshcreole Mar 19, 2025

@ruizguille my use case is I want to get the chunk of the doc inorder to store in a vector database. And I am hosting the application on vercel serverless platform. Now using the HybridChunker with HF tokenizer loads the model locally which is not supported in vercel runtime environment due to its limitation.

Can you provide me some alternative? Is there any way where I can use HF inference API without loading the model locally and deploy my app?

vagenas Mar 19, 2025
Maintainer

Hi @ruizguille, both proposals sound reasonable!

jayeshcreole · 2025-03-19T06:08:21Z

jayeshcreole
Mar 19, 2025

@vagenas Is there any way to run docling with hybridChunker in a serverless environment?

1 reply

vagenas Mar 19, 2025
Maintainer

Hi @jayeshcreole, we currently only support locally running models, not e.g. via the HF Inference API. In case the limitation is due to one-time model download from HF Hub, then if you can package the model inside your application, you can use it locally by just pointing to the local folder instead of the model name on HF Hub.

dominicdill · 2025-09-26T18:47:38Z

dominicdill
Sep 26, 2025

What is the status of this? Can we use openai's embedding models with the hybrid chunker out of the box?
I see the following reference on the hybrid chunker example page: https://docling-project.github.io/docling/examples/hybrid_chunking/#configuring-tokenization

But this doesn't reference the embedding models as far as I understand. What should I do?

2 replies

faileon Oct 13, 2025

What is the status of this? Can we use openai's embedding models with the hybrid chunker out of the box? I see the following reference on the hybrid chunker example page: https://docling-project.github.io/docling/examples/hybrid_chunking/#configuring-tokenization
But this doesn't reference the embedding models as far as I understand. What should I do?

AFAIK OpenAI uses same tokenizer (tiktoken) for their GPTs as well as for their embedding models, so it should work.

However how can I specify tiktoken in the docling-serve package? How can I specify the tokenizer via chunking_tokenizer param?

dominicdill Oct 14, 2025

This is what I did. My code here
I used the patchedopenaitokenizer because i was trying to embed text with special tokens and running into an issue.
I hope that answers your question

HybridChunker with tiktoken tokenizer #1031

Uh oh!

Replies: 6 comments · 8 replies

Uh oh!

vagenas Mar 3, 2025 Maintainer

Uh oh!

ruizguille Mar 4, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruizguille Mar 13, 2025 Author

Uh oh!

Uh oh!

vagenas Mar 19, 2025 Maintainer

Uh oh!

Uh oh!

vagenas Mar 19, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 8 replies

vagenas
Mar 3, 2025
Maintainer

ruizguille
Mar 4, 2025
Author

ruizguille
Mar 13, 2025
Author

vagenas Mar 19, 2025
Maintainer

vagenas Mar 19, 2025
Maintainer