Skip to content

Conversation

@jue-jue-zi
Copy link
Contributor

@jue-jue-zi jue-jue-zi commented Sep 15, 2025

What does this PR do?

When using load_dataset(.., streaming=True) to load the training set as an IterableDataset, the dataset.column_names may be None.

If use_liger_kernel is enabled at the same time, SFTTrainer will attempt to remove unused columns during data preparation:

collator_expected_keys = {"input_ids", "seq_lengths", "completion_mask", "assistant_masks"}
dataset.select_columns(collator_expected_keys.intersection(dataset.column_names))

Since dataset.column_names is None in this case, this operation causes a training error.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@jue-jue-zi jue-jue-zi force-pushed the main branch 2 times, most recently from f040ffb to f731eb8 Compare September 17, 2025 06:40
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks!

@qgallouedec qgallouedec changed the title fix: use_liger_kernel with IterableDataset Sep 23, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec merged commit abe07c9 into huggingface:main Sep 23, 2025
6 of 10 checks passed
qgallouedec added a commit that referenced this pull request Sep 23, 2025
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
qgallouedec added a commit that referenced this pull request Oct 2, 2025
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants