⚡ faster-whitespace-pretok

This is a performance fork of Hugging Face’s tokenizers, focused on optimizing the Whitespace PreTokenizer.
It preserves all original functionality and directory layout of tokenizers/tokenizers for compatibility — including benchmark support and test coverage.

🔧 Pull Request: huggingface/tokenizers#1822

🚀 What’s New in This Fork?

✅ Optimized `Whitespace` PreTokenizer

Replaced regex-based logic with a cache-efficient manual traversal using char_indices().
No change to output behavior — identical span offsets and splits.
Drop-in compatible with all existing pipelines.

✅ Criterion Benchmark Added

Added benches/whitespace_bench.rs
Measures short, medium, and long inputs
Registered in Cargo.toml:

[[bench]]
name = "whitespace_bench"
harness = false

✅ Additional Variant: `WhitespaceSplit`

Lightweight alternative that only splits on whitespace (no span tracking).
Useful for standalone benchmarking or ultra-fast preprocessing.

📊 Benchmarks

Benchmarked using Criterion across 5 test cycles:

Input Type	Avg. Time (Original)	Avg. Time (Optimized)	Speedup
Short	~620 ns	~555 ns	✅ 10–15%
Medium	4.3 µs	3.7–4.0 µs	✅ 5–30%
Long	~60–74 µs	~50–63 µs	✅ 5–15%

⚡ Visual Benchmark

🔬 Output remains identical to the original Whitespace implementation.
🧪 Verified with robust unit tests.
🔁 Consistent results across runs.

🧠 Technical Highlights

❌ No regex (avoids unnecessary overhead)
✅ Manual char_indices() loop for precision and cache-friendliness
🧠 Inline span classification
💡 Zero additional dependencies
🔄 Fully backwards-compatible with impl_serde_type!

📎 Related Issue

Improves local benchmarking infrastructure and test coverage related to: #1820

This PR does not fix dataset download issues directly, but adds independent, reproducible local benchmarking support.

🔧 Installation & Usage

Clone the fork and use it as a drop-in tokenizers/tokenizers replacement:

git clone --branch faster-whitespace-pretok https://github.com/8ria/tokenizers.git
cd tokenizers/tokenizers
cargo bench --bench whitespace_bench

Use your own sample inputs by editing whitespace_bench.rs.

📦 Python Installation (from this fork)

To use the Python bindings with the optimized version:

pip install git+https://github.com/8ria/faster-whitespace-pretok.git#subdirectory=bindings/python

All Python-facing behavior remains identical to upstream tokenizers.

🙌 Why This Matters

Whitespace pre-tokenization is executed millions of times in ML workflows:

LLM inference
Prompt batching
Offline training pipelines

Even small improvements in this phase compound at scale — especially when parallelized.

This fork improves efficiency without changing outputs or APIs.

📫 Contact

AndriaK - hey@andriaK.com - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 1,891 Commits
.github		.github
bindings		bindings
docs		docs
tokenizers		tokenizers
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
comparison.png		comparison.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ faster-whitespace-pretok

🚀 What’s New in This Fork?

✅ Optimized `Whitespace` PreTokenizer

✅ Criterion Benchmark Added

✅ Additional Variant: `WhitespaceSplit`

📊 Benchmarks

⚡ Visual Benchmark

🧠 Technical Highlights

📎 Related Issue

🔧 Installation & Usage

📦 Python Installation (from this fork)

🙌 Why This Matters

📫 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ faster-whitespace-pretok

🚀 What’s New in This Fork?

✅ Optimized Whitespace PreTokenizer

✅ Criterion Benchmark Added

✅ Additional Variant: WhitespaceSplit

📊 Benchmarks

⚡ Visual Benchmark

🧠 Technical Highlights

📎 Related Issue

🔧 Installation & Usage

📦 Python Installation (from this fork)

🙌 Why This Matters

📫 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✅ Optimized `Whitespace` PreTokenizer

✅ Additional Variant: `WhitespaceSplit`

Packages