Skip to content

Conversation

@SolarWindRider
Copy link
Contributor

@SolarWindRider SolarWindRider commented Oct 24, 2025

What does this PR do?

This PR introduces a new trainer named PAPOTrainer, which extends GRPOTrainer to support the PAPO (Preference Alignment via Policy Optimization) algorithm.

Motivation

PAPO is a variant of GRPO that incorporates a contrastive preference optimization mechanism to improve stability when positive samples are sparse. But the official code use verl. To make it convenient for everyone to use, I implemented the TRL version of the code based on the PAPO formula, and it runs successfully.

Implementation Details

  • Added trl/trainer/papo_trainer.py
  • Added trl/trainer/papo_config.py
  • Updated __init__.py to include PAPOTrainer
  • All tests pass locally with pytest tests/trainer/test_papo_trainer.py -v

🧪 Example Usage

https://github.com/SolarWindRider/avr/blob/main/train_papo.py

I have tested my trainer[with PEFT and FSDP] on Ascend910C and H20[single node with 8 cards].

Checklist

  • I have tested this code locally
  • I have run all tests with pytest
  • I have followed the code style guidelines
  • I have added docstrings and comments
@qgallouedec
Copy link
Member

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

SolarWindRider and others added 4 commits October 26, 2025 00:51
	renamed:    trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py
	renamed:    trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py
	modified:   trl/trainer/__init__.py
@SolarWindRider
Copy link
Contributor Author

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

Thank you for your advice. I have moved this new trainer to trl.experimental and also added PAPO info in paper index.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@qgallouedec qgallouedec changed the title feat(trainer): add PAPOTrainer for preference-based optimization Oct 31, 2025
@qgallouedec qgallouedec merged commit 06c059b into huggingface:main Oct 31, 2025
@SolarWindRider
Copy link
Contributor Author

LGTM, thanks!

Thank you for your help!
{\__/}
( • - •)
/つThanks

@qgallouedec qgallouedec mentioned this pull request Dec 10, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants