Skip to content

8ria/tokenizers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,891 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



Build GitHub

⚑ faster-whitespace-pretok

This is a performance fork of Hugging Face’s tokenizers, focused on optimizing the Whitespace PreTokenizer.
It preserves all original functionality and directory layout of tokenizers/tokenizers for compatibility β€” including benchmark support and test coverage.

πŸ”§ Pull Request: huggingface/tokenizers#1822


πŸš€ What’s New in This Fork?

βœ… Optimized Whitespace PreTokenizer

  • Replaced regex-based logic with a cache-efficient manual traversal using char_indices().
  • No change to output behavior β€” identical span offsets and splits.
  • Drop-in compatible with all existing pipelines.

βœ… Criterion Benchmark Added

  • Added benches/whitespace_bench.rs
  • Measures short, medium, and long inputs
  • Registered in Cargo.toml:
[[bench]]
name = "whitespace_bench"
harness = false

βœ… Additional Variant: WhitespaceSplit

  • Lightweight alternative that only splits on whitespace (no span tracking).
  • Useful for standalone benchmarking or ultra-fast preprocessing.

πŸ“Š Benchmarks

Benchmarked using Criterion across 5 test cycles:

Input Type Avg. Time (Original) Avg. Time (Optimized) Speedup
Short ~620 ns ~555 ns βœ… 10–15%
Medium 4.3 Β΅s 3.7–4.0 Β΅s βœ… 5–30%
Long ~60–74 Β΅s ~50–63 Β΅s βœ… 5–15%

⚑ Visual Benchmark

Whitespace PreTokenizer Benchmark Results

  • πŸ”¬ Output remains identical to the original Whitespace implementation.
  • πŸ§ͺ Verified with robust unit tests.
  • πŸ” Consistent results across runs.

🧠 Technical Highlights

  • ❌ No regex (avoids unnecessary overhead)
  • βœ… Manual char_indices() loop for precision and cache-friendliness
  • 🧠 Inline span classification
  • πŸ’‘ Zero additional dependencies
  • πŸ”„ Fully backwards-compatible with impl_serde_type!

πŸ“Ž Related Issue

Improves local benchmarking infrastructure and test coverage related to: #1820

This PR does not fix dataset download issues directly, but adds independent, reproducible local benchmarking support.


πŸ”§ Installation & Usage

Clone the fork and use it as a drop-in tokenizers/tokenizers replacement:

git clone --branch faster-whitespace-pretok https://github.com/8ria/tokenizers.git
cd tokenizers/tokenizers
cargo bench --bench whitespace_bench

Use your own sample inputs by editing whitespace_bench.rs.


πŸ“¦ Python Installation (from this fork)

To use the Python bindings with the optimized version:

pip install git+https://github.com/8ria/faster-whitespace-pretok.git#subdirectory=bindings/python

All Python-facing behavior remains identical to upstream tokenizers.


πŸ™Œ Why This Matters

Whitespace pre-tokenization is executed millions of times in ML workflows:

  • LLM inference
  • Prompt batching
  • Offline training pipelines

Even small improvements in this phase compound at scale β€” especially when parallelized.

This fork improves efficiency without changing outputs or APIs.


πŸ“« Contact

AndriaK - hey@andriaK.com - GitHub

About

πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 72.4%
  • Python 20.0%
  • Jupyter Notebook 4.4%
  • TypeScript 2.3%
  • JavaScript 0.4%
  • CSS 0.3%
  • Other 0.2%