[GRPOTrainer]: Add SAPO Loss #4600

pramodith · 2025-11-28T16:01:16Z

What does this PR do?

The Qwen team released a new paper Soft Adaptive Policy Optimization that introduces a new loss called SAPO that gets rid of hard clipping to token losses with a tempature controlled sigmoid gate.

We propose Soft Adaptive Policy Optimization (SAPO), which
replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals.

The loss function is formulated as

with r and A being the importance ratios and Advantage scores

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-11-28T16:08:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…into pramodith/sapo_loss

DaehanKim

Thank you for your contribution! Here's minor fix I want to suggest

trl/trainer/grpo_config.py

qgallouedec · 2025-12-01T19:56:58Z

tests/test_grpo_trainer.py

+        trainer = GRPOTrainer(
+            model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
+            reward_funcs=[accuracy_reward],
+        )


since it's a static method, I don't think you need to instantiate the trainer

qgallouedec

very nice one, and great addition! I only have a few remarks, and I think we're good to merge!

qgallouedec · 2025-12-01T19:58:54Z

trl/trainer/grpo_config.py

+        sapo_temperature_neg (`float`, *optional*):
+            Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. If not
+            specified, it defaults to `1.05`. This parameter is introduced in the [Soft Adaptive Policy Optimization
+            paper](https://huggingface.co/papers/2511.20347).
+        sapo_temperature_pos (`float`, *optional*):
+            Temperature for tokens with positive advantage scores used in the `sapo` loss function. If not specified,
+            it defaults to `1.0`. This parameter is introduced in the [Soft Adaptive Policy Optimization
+            paper](https://huggingface.co/papers/2511.20347).


Suggested change

sapo_temperature_neg (`float`, *optional*):

Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. If not

specified, it defaults to `1.05`. This parameter is introduced in the [Soft Adaptive Policy Optimization

paper](https://huggingface.co/papers/2511.20347).

sapo_temperature_pos (`float`, *optional*):

Temperature for tokens with positive advantage scores used in the `sapo` loss function. If not specified,

it defaults to `1.0`. This parameter is introduced in the [Soft Adaptive Policy Optimization

paper](https://huggingface.co/papers/2511.20347).

sapo_temperature_neg (`float`, *optional*, defaults to `1.05`):

Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. This parameter

is introduced in the [Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347).

sapo_temperature_pos (`float`, *optional*, defaults to `1.0`):

Temperature for tokens with positive advantage scores used in the `sapo` loss function. This parameter is

introduced in the [Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347).```

qgallouedec · 2025-12-01T19:59:49Z

trl/trainer/grpo_config.py

            "[ScaleRL paper]https://huggingface.co/papers/2510.13786) and the recommended value is `5.0`."
        },
    )
+    sapo_temperature_neg: float | None = field(


Suggested change

sapo_temperature_neg: float | None = field(

sapo_temperature_neg: float = field(

qgallouedec · 2025-12-01T19:59:57Z

trl/trainer/grpo_config.py

+            "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)."
+        },
+    )
+    sapo_temperature_pos: float | None = field(


Suggested change

sapo_temperature_pos: float | None = field(

sapo_temperature_pos: float = field(

qgallouedec · 2025-12-01T20:00:50Z

trl/trainer/grpo_config.py

+            "help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. "
+            "If not specified, it defaults to `1.05`. This parameter is introduced in the "
+            "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)."


Suggested change

"help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. "

"If not specified, it defaults to `1.05`. This parameter is introduced in the "

"[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)."

"help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. "

"This parameter is introduced in the [Soft Adaptive Policy Optimization "

"paper](https://huggingface.co/papers/2511.20347)."

qgallouedec · 2025-12-01T20:01:27Z

trl/trainer/grpo_config.py

+            "help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. "
+            "If not specified, it defaults to `1.0`. This parameter is introduced in the "
+            "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)."


Suggested change

"help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. "

"If not specified, it defaults to `1.0`. This parameter is introduced in the "

"[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)."

"help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. "

"This parameter is introduced in the [Soft Adaptive Policy Optimization "

"paper](https://huggingface.co/papers/2511.20347)."

qgallouedec · 2025-12-01T20:02:37Z

docs/source/paper_index.md

+```python
+from trl import GRPOConfig
+
+config = GRPOConfig(


for consistency

Suggested change

config = GRPOConfig(

training_args = GRPOConfig(

qgallouedec · 2025-12-01T20:05:44Z

trl/trainer/grpo_trainer.py

            per_token_loss = -torch.min(per_token_loss1, per_token_loss2)
+        elif self.loss_type == "sapo":
+            per_token_loss = torch.empty_like(coef_1)
+            pos_advantages_mask = advantages.repeat([1, coef_1.shape[1]]) > 0


can you use positive instead of pos

qgallouedec · 2025-12-01T20:13:30Z

trl/trainer/grpo_trainer.py

-        if self.loss_type == "grpo":
+        if self.loss_type in ["grpo", "sapo"]:
            loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
            loss = loss / self.current_gradient_accumulation_steps
        elif self.loss_type == "bnpo":
            loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
            loss = loss / self.current_gradient_accumulation_steps
        elif self.loss_type == "dr_grpo":
            loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)
            loss = loss / self.current_gradient_accumulation_steps
        elif self.loss_type in ["cispo", "dapo"]:
            normalizer = inputs["num_items_in_batch"] / self.accelerator.num_processes
            loss = (per_token_loss * completion_mask).sum() / normalizer
        else:
            raise ValueError(f"Unknown loss type: {self.loss_type}")


I'm wondering (not for this PR) if we should decouple the aggregation type and the loss type

Yes! I find the current setup a bit annoying tbh.

qgallouedec · 2025-12-01T20:32:22Z

tests/test_grpo_trainer.py

        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")

        training_args = GRPOConfig(
+            fp16=True,


Suggested change

fp16=True,

qgallouedec

lgtm!

commit cbd90d4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Dec 4 20:05:43 2025 +0100 Remove deprecated batched formatting in GOLDTrainer (huggingface#4622) commit 903b57d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:16:00 2025 +0100 Update ministral notebooks with official bf16 ckpt (huggingface#4626) commit 9266135 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:01:46 2025 +0100 Fix link to OpenEnv blog in docs (huggingface#4625) commit 495381d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 11:32:34 2025 +0100 Fix README style (huggingface#4619) commit ddb65e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 21:20:40 2025 +0100 Add experimental imports to docs (huggingface#4616) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5fab472 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 17:38:16 2025 +0100 Replace arXiv paper links with HF links (huggingface#4613) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit a3c1dfb Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Dec 3 17:28:45 2025 +0100 Add ministral 3 free notebooks (huggingface#4614) commit 560fd3d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Wed Dec 3 10:12:20 2025 +0000 [GRPOTrainer]: Add SAPO Loss (huggingface#4600) commit 814d4af Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Dec 2 15:52:51 2025 -0700 Move MergeModelCallback to experimental (huggingface#4608) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 2a81076 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Tue Dec 2 15:07:11 2025 +0100 Fixed OpenEnv example scripts (huggingface#4610) commit de343cd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Dec 2 07:32:22 2025 +0100 Remove deprecations for 0.26 release (huggingface#4607) commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (huggingface#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (huggingface#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (huggingface#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (huggingface#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (huggingface#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (huggingface#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (huggingface#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (huggingface#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (huggingface#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (huggingface#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (huggingface#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (huggingface#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (huggingface#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (huggingface#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (huggingface#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (huggingface#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (huggingface#4574)

commit f278d03 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 19:34:42 2025 +0100 Remove no longer applicable warning once BCO was moved to experimental (huggingface#4628) commit e7071bf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Dec 5 10:07:16 2025 -0700 Add logos as assets (huggingface#4627) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 794d87f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 08:45:20 2025 +0100 Add missing experimental autodoc classes to docs (huggingface#4618) commit bc7888d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:48:33 2025 +0100 Raise FutureWarning for trainer moved to experimental (huggingface#4620) commit fce5dfd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:34:04 2025 +0100 Raise warnings at 2nd stack level (huggingface#4621) commit c5da8ec Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:33:04 2025 +0100 Silence experimental warning during docs build (huggingface#4623) commit 2af35fb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Dec 4 21:23:41 2025 -0700 Clean up model preparation (huggingface#4577) commit cbd90d4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Dec 4 20:05:43 2025 +0100 Remove deprecated batched formatting in GOLDTrainer (huggingface#4622) commit 903b57d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:16:00 2025 +0100 Update ministral notebooks with official bf16 ckpt (huggingface#4626) commit 9266135 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:01:46 2025 +0100 Fix link to OpenEnv blog in docs (huggingface#4625) commit 495381d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 11:32:34 2025 +0100 Fix README style (huggingface#4619) commit ddb65e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 21:20:40 2025 +0100 Add experimental imports to docs (huggingface#4616) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5fab472 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 17:38:16 2025 +0100 Replace arXiv paper links with HF links (huggingface#4613) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit a3c1dfb Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Dec 3 17:28:45 2025 +0100 Add ministral 3 free notebooks (huggingface#4614) commit 560fd3d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Wed Dec 3 10:12:20 2025 +0000 [GRPOTrainer]: Add SAPO Loss (huggingface#4600) commit 814d4af Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Dec 2 15:52:51 2025 -0700 Move MergeModelCallback to experimental (huggingface#4608) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 2a81076 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Tue Dec 2 15:07:11 2025 +0100 Fixed OpenEnv example scripts (huggingface#4610) commit de343cd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Dec 2 07:32:22 2025 +0100 Remove deprecations for 0.26 release (huggingface#4607) commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (huggingface#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (huggingface#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (huggingface#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (huggingface#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (huggingface#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (huggingface#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (huggingface#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (huggingface#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (huggingface#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (huggingface#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (huggingface#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (huggingface#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (huggingface#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (huggingface#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (huggingface#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (huggingface#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (huggingface#4574)

pramodith and others added 4 commits November 28, 2025 15:57

Add SAPO loss

7455847

add "sapo" in aggregation logic.

1063543

fix condition.

6fdc8dd

Merge branch 'main' into pramodith/sapo_loss

af23ea9

pramodith and others added 7 commits November 28, 2025 16:08

Add sapo to paper index.

16e3d50

Merge branch 'pramodith/sapo_loss' of https://github.com/pramodith/trl …

bc633cc

…into pramodith/sapo_loss

minor fixes.

290bf3e

precommit.

61cfc67

use group std.

0c9b3d7

fix test case.

ff0fa75

Merge branch 'main' into pramodith/sapo_loss

cce8abe

DaehanKim suggested changes Nov 29, 2025

View reviewed changes

trl/trainer/grpo_config.py Show resolved Hide resolved

trl/trainer/grpo_config.py Show resolved Hide resolved

qgallouedec reviewed Dec 1, 2025

View reviewed changes

qgallouedec requested changes Dec 1, 2025

View reviewed changes

qgallouedec reviewed Dec 1, 2025

View reviewed changes

pramodith added 2 commits December 2, 2025 01:06

Address comments.

804b620

Merge branch 'main' into pramodith/sapo_loss

1b2f821

qgallouedec approved these changes Dec 3, 2025

View reviewed changes

pramodith merged commit 560fd3d into huggingface:main Dec 3, 2025
8 of 9 checks passed

qgallouedec pushed a commit to neha222222/trl that referenced this pull request Dec 5, 2025

[GRPOTrainer]: Add SAPO Loss (huggingface#4600)

947c02a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPOTrainer]: Add SAPO Loss #4600

[GRPOTrainer]: Add SAPO Loss #4600

pramodith commented Nov 28, 2025

HuggingFaceDocBuilderDev commented Nov 28, 2025

DaehanKim left a comment

Uh oh!

Uh oh!

qgallouedec Dec 1, 2025

qgallouedec left a comment

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025 •

edited

Loading

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025

qgallouedec Dec 1, 2025

pramodith Dec 2, 2025

qgallouedec Dec 1, 2025

qgallouedec left a comment

Uh oh!

Labels

4 participants

	sapo_temperature_neg: float \| None = field(
	sapo_temperature_neg: float = field(

	sapo_temperature_pos: float \| None = field(
	sapo_temperature_pos: float = field(

[GRPOTrainer]: Add SAPO Loss #4600

[GRPOTrainer]: Add SAPO Loss #4600

Conversation

pramodith commented Nov 28, 2025

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Nov 28, 2025

DaehanKim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

qgallouedec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qgallouedec Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

qgallouedec Dec 1, 2025 •

edited

Loading