This is a performance fork of Hugging Faceβs tokenizers, focused on optimizing the Whitespace PreTokenizer.
It preserves all original functionality and directory layout of tokenizers/tokenizers for compatibility β including benchmark support and test coverage.
π§ Pull Request: huggingface/tokenizers#1822
- Replaced regex-based logic with a cache-efficient manual traversal using
char_indices(). - No change to output behavior β identical span offsets and splits.
- Drop-in compatible with all existing pipelines.
- Added
benches/whitespace_bench.rs - Measures short, medium, and long inputs
- Registered in
Cargo.toml:
[[bench]]
name = "whitespace_bench"
harness = false- Lightweight alternative that only splits on whitespace (no span tracking).
- Useful for standalone benchmarking or ultra-fast preprocessing.
Benchmarked using Criterion across 5 test cycles:
| Input Type | Avg. Time (Original) | Avg. Time (Optimized) | Speedup |
|---|---|---|---|
| Short | ~620 ns | ~555 ns | β 10β15% |
| Medium | 4.3 Β΅s | 3.7β4.0 Β΅s | β 5β30% |
| Long | ~60β74 Β΅s | ~50β63 Β΅s | β 5β15% |
- π¬ Output remains identical to the original
Whitespaceimplementation. - π§ͺ Verified with robust unit tests.
- π Consistent results across runs.
- β No regex (avoids unnecessary overhead)
- β
Manual
char_indices()loop for precision and cache-friendliness - π§ Inline span classification
- π‘ Zero additional dependencies
- π Fully backwards-compatible with
impl_serde_type!
Improves local benchmarking infrastructure and test coverage related to: #1820
This PR does not fix dataset download issues directly, but adds independent, reproducible local benchmarking support.
Clone the fork and use it as a drop-in tokenizers/tokenizers replacement:
git clone --branch faster-whitespace-pretok https://github.com/8ria/tokenizers.git
cd tokenizers/tokenizers
cargo bench --bench whitespace_benchUse your own sample inputs by editing whitespace_bench.rs.
To use the Python bindings with the optimized version:
pip install git+https://github.com/8ria/faster-whitespace-pretok.git#subdirectory=bindings/pythonAll Python-facing behavior remains identical to upstream
tokenizers.
Whitespace pre-tokenization is executed millions of times in ML workflows:
- LLM inference
- Prompt batching
- Offline training pipelines
Even small improvements in this phase compound at scale β especially when parallelized.
This fork improves efficiency without changing outputs or APIs.
AndriaK - hey@andriaK.com - GitHub

