Add PAPOTrainer for preference-based optimization #4334

SolarWindRider · 2025-10-24T08:38:20Z

What does this PR do?

This PR introduces a new trainer named PAPOTrainer, which extends GRPOTrainer to support the PAPO (Preference Alignment via Policy Optimization) algorithm.

Motivation

PAPO is a variant of GRPO that incorporates a contrastive preference optimization mechanism to improve stability when positive samples are sparse. But the official code use verl. To make it convenient for everyone to use, I implemented the TRL version of the code based on the PAPO formula, and it runs successfully.

Implementation Details

Added trl/trainer/papo_trainer.py
Added trl/trainer/papo_config.py
Updated __init__.py to include PAPOTrainer
All tests pass locally with pytest tests/trainer/test_papo_trainer.py -v

🧪 Example Usage

https://github.com/SolarWindRider/avr/blob/main/train_papo.py

I have tested my trainer[with PEFT and FSDP] on Ascend910C and H20[single node with 8 cards].

Checklist

I have tested this code locally
I have run all tests with pytest
I have followed the code style guidelines
I have added docstrings and comments

qgallouedec · 2025-10-24T23:57:52Z

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

renamed: trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py renamed: trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py modified: trl/trainer/__init__.py

SolarWindRider · 2025-10-25T17:20:26Z

Thanks for your contribution. Can you mode this new trainer to trl.experimental instead? Ideally, we would also have a small mention in the paper index section of the documentation

Thank you for your advice. I have moved this new trainer to trl.experimental and also added PAPO info in paper index.

docs/source/paper_index.md

trl/experimental/papo/papo_trainer.py

…/trl-papo into feat/trainer-papo

HuggingFaceDocBuilderDev · 2025-10-31T02:57:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

LGTM, thanks!

SolarWindRider · 2025-10-31T03:42:03Z

LGTM, thanks!

Thank you for your help!
{\__/}
( • - •)
/つThanks

solarwindrider and others added 3 commits October 24, 2025 16:19

feat(trainer): add PAPOTrainer for preference-based optimization

727a783

Merge branch 'main' into feat/trainer-papo

c2c4c56

Merge branch 'main' into feat/trainer-papo

c32a23c

SolarWindRider and others added 4 commits October 26, 2025 00:51

Merge branch 'main' into feat/trainer-papo

e4af368

new file: trl/experimental/papo/__init__.py

1b1fb4c

renamed: trl/trainer/papo_config.py -> trl/experimental/papo/papo_config.py renamed: trl/trainer/papo_trainer.py -> trl/experimental/papo/papo_trainer.py modified: trl/trainer/__init__.py

move papo to exp and add paper index

a3751e2

clean trainer __init__.py

4f2b250

qgallouedec reviewed Oct 28, 2025

View reviewed changes

docs/source/paper_index.md Outdated Show resolved Hide resolved

precommit

75b231d

qgallouedec reviewed Oct 28, 2025

View reviewed changes

trl/experimental/papo/papo_trainer.py Outdated Show resolved Hide resolved

SolarWindRider and others added 10 commits October 28, 2025 14:20

Merge branch 'huggingface:main' into feat/trainer-papo

59e7222

paper index info And a class method for the citation

04b986b

Merge branch 'feat/trainer-papo' of https://github.com/SolarWindRider…

25c3595

…/trl-papo into feat/trainer-papo

Merge branch 'main' into feat/trainer-papo

6d25b0a

fix conversational inputs bugs

6422795

Merge branch 'main' into feat/trainer-papo

4ccfd28

Merge branch 'main' into feat/trainer-papo

bfaee35

Merge branch 'main' into feat/trainer-papo

f85a53d

perception clip

57fbcd8

Merge branch 'main' into feat/trainer-papo

63670e6

SolarWindRider requested a review from qgallouedec October 31, 2025 02:44

SolarWindRider closed this Oct 31, 2025

SolarWindRider reopened this Oct 31, 2025

qgallouedec added 2 commits October 31, 2025 02:49

Style and add section to doc

c8511d0

fix link

b0ed5c0

add .

68d3ce6

qgallouedec approved these changes Oct 31, 2025

View reviewed changes

qgallouedec changed the title ~~feat(trainer): add PAPOTrainer for preference-based optimization~~ Oct 31, 2025

qgallouedec merged commit 06c059b into huggingface:main Oct 31, 2025

qgallouedec mentioned this pull request Dec 10, 2025

feature: Add RTPO Trainer #4652

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PAPOTrainer for preference-based optimization #4334

Add PAPOTrainer for preference-based optimization #4334

Uh oh!

SolarWindRider commented Oct 24, 2025 •

edited

Loading

qgallouedec commented Oct 24, 2025

SolarWindRider commented Oct 25, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 31, 2025

qgallouedec left a comment

SolarWindRider commented Oct 31, 2025

Labels

3 participants

Add PAPOTrainer for preference-based optimization #4334

Add PAPOTrainer for preference-based optimization #4334

Uh oh!

Conversation

SolarWindRider commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Implementation Details

🧪 Example Usage

Checklist

qgallouedec commented Oct 24, 2025

SolarWindRider commented Oct 25, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 31, 2025

qgallouedec left a comment

Choose a reason for hiding this comment

SolarWindRider commented Oct 31, 2025

Labels

3 participants

SolarWindRider commented Oct 24, 2025 •

edited

Loading