Skip to content

Multilang splitter#265

Merged
Ingvarstep merged 10 commits into
urchade:mainfrom
oleksandrlukashov:multilang_splitter
Jun 12, 2025
Merged

Multilang splitter#265
Ingvarstep merged 10 commits into
urchade:mainfrom
oleksandrlukashov:multilang_splitter

Conversation

@oleksandrlukashov

Copy link
Copy Markdown
Contributor

No description provided.

@urchade urchade requested a review from Ingvarstep June 11, 2025 20:44
@urchade

urchade commented Jun 11, 2025

Copy link
Copy Markdown
Owner

LGTM

@Ingvarstep

Copy link
Copy Markdown
Collaborator

Good job overall, I have a few suggestions on how to make it more flexible and efficient:

  • I think it would be better to create a dictionary of splitters. If a new language appears, we can just add a new splitter;
  • By default, the splitter for other languages is WhitespaceTokenSplitter, I propose to make it flexible, it can be spacy with language key - xx.
  • Ideally, all tokenizer dependencies make optional, please, see how it was done for onnx-gpu.
@Ingvarstep Ingvarstep merged commit 638310c into urchade:main Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants