-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[GRPOTrainer]: Add SAPO Loss #4600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
DaehanKim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! Here's minor fix I want to suggest
tests/test_grpo_trainer.py
Outdated
| trainer = GRPOTrainer( | ||
| model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5", | ||
| reward_funcs=[accuracy_reward], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's a static method, I don't think you need to instantiate the trainer
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice one, and great addition! I only have a few remarks, and I think we're good to merge!
trl/trainer/grpo_config.py
Outdated
| sapo_temperature_neg (`float`, *optional*): | ||
| Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. If not | ||
| specified, it defaults to `1.05`. This parameter is introduced in the [Soft Adaptive Policy Optimization | ||
| paper](https://huggingface.co/papers/2511.20347). | ||
| sapo_temperature_pos (`float`, *optional*): | ||
| Temperature for tokens with positive advantage scores used in the `sapo` loss function. If not specified, | ||
| it defaults to `1.0`. This parameter is introduced in the [Soft Adaptive Policy Optimization | ||
| paper](https://huggingface.co/papers/2511.20347). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sapo_temperature_neg (`float`, *optional*): | |
| Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. If not | |
| specified, it defaults to `1.05`. This parameter is introduced in the [Soft Adaptive Policy Optimization | |
| paper](https://huggingface.co/papers/2511.20347). | |
| sapo_temperature_pos (`float`, *optional*): | |
| Temperature for tokens with positive advantage scores used in the `sapo` loss function. If not specified, | |
| it defaults to `1.0`. This parameter is introduced in the [Soft Adaptive Policy Optimization | |
| paper](https://huggingface.co/papers/2511.20347). | |
| sapo_temperature_neg (`float`, *optional*, defaults to `1.05`): | |
| Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. This parameter | |
| is introduced in the [Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347). | |
| sapo_temperature_pos (`float`, *optional*, defaults to `1.0`): | |
| Temperature for tokens with positive advantage scores used in the `sapo` loss function. This parameter is | |
| introduced in the [Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347).``` |
trl/trainer/grpo_config.py
Outdated
| "[ScaleRL paper]https://huggingface.co/papers/2510.13786) and the recommended value is `5.0`." | ||
| }, | ||
| ) | ||
| sapo_temperature_neg: float | None = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sapo_temperature_neg: float | None = field( | |
| sapo_temperature_neg: float = field( |
trl/trainer/grpo_config.py
Outdated
| "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)." | ||
| }, | ||
| ) | ||
| sapo_temperature_pos: float | None = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sapo_temperature_pos: float | None = field( | |
| sapo_temperature_pos: float = field( |
trl/trainer/grpo_config.py
Outdated
| "help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. " | ||
| "If not specified, it defaults to `1.05`. This parameter is introduced in the " | ||
| "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. " | |
| "If not specified, it defaults to `1.05`. This parameter is introduced in the " | |
| "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)." | |
| "help": "Temperature for tokens with non-positive advantage scores used in the `sapo` loss function. " | |
| "This parameter is introduced in the [Soft Adaptive Policy Optimization " | |
| "paper](https://huggingface.co/papers/2511.20347)." |
trl/trainer/grpo_config.py
Outdated
| "help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. " | ||
| "If not specified, it defaults to `1.0`. This parameter is introduced in the " | ||
| "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. " | |
| "If not specified, it defaults to `1.0`. This parameter is introduced in the " | |
| "[Soft Adaptive Policy Optimization paper](https://huggingface.co/papers/2511.20347)." | |
| "help": "Temperature for tokens with positive advantage scores used in the `sapo` loss function. " | |
| "This parameter is introduced in the [Soft Adaptive Policy Optimization " | |
| "paper](https://huggingface.co/papers/2511.20347)." |
docs/source/paper_index.md
Outdated
| ```python | ||
| from trl import GRPOConfig | ||
|
|
||
| config = GRPOConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency
| config = GRPOConfig( | |
| training_args = GRPOConfig( |
trl/trainer/grpo_trainer.py
Outdated
| per_token_loss = -torch.min(per_token_loss1, per_token_loss2) | ||
| elif self.loss_type == "sapo": | ||
| per_token_loss = torch.empty_like(coef_1) | ||
| pos_advantages_mask = advantages.repeat([1, coef_1.shape[1]]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use positive instead of pos
| if self.loss_type == "grpo": | ||
| if self.loss_type in ["grpo", "sapo"]: | ||
| loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean() | ||
| loss = loss / self.current_gradient_accumulation_steps | ||
| elif self.loss_type == "bnpo": | ||
| loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0) | ||
| loss = loss / self.current_gradient_accumulation_steps | ||
| elif self.loss_type == "dr_grpo": | ||
| loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length) | ||
| loss = loss / self.current_gradient_accumulation_steps | ||
| elif self.loss_type in ["cispo", "dapo"]: | ||
| normalizer = inputs["num_items_in_batch"] / self.accelerator.num_processes | ||
| loss = (per_token_loss * completion_mask).sum() / normalizer | ||
| else: | ||
| raise ValueError(f"Unknown loss type: {self.loss_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering (not for this PR) if we should decouple the aggregation type and the loss type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I find the current setup a bit annoying tbh.
tests/test_grpo_trainer.py
Outdated
| dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train") | ||
|
|
||
| training_args = GRPOConfig( | ||
| fp16=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fp16=True, |
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
commit cbd90d4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Dec 4 20:05:43 2025 +0100 Remove deprecated batched formatting in GOLDTrainer (huggingface#4622) commit 903b57d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:16:00 2025 +0100 Update ministral notebooks with official bf16 ckpt (huggingface#4626) commit 9266135 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:01:46 2025 +0100 Fix link to OpenEnv blog in docs (huggingface#4625) commit 495381d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 11:32:34 2025 +0100 Fix README style (huggingface#4619) commit ddb65e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 21:20:40 2025 +0100 Add experimental imports to docs (huggingface#4616) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5fab472 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 17:38:16 2025 +0100 Replace arXiv paper links with HF links (huggingface#4613) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit a3c1dfb Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Dec 3 17:28:45 2025 +0100 Add ministral 3 free notebooks (huggingface#4614) commit 560fd3d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Wed Dec 3 10:12:20 2025 +0000 [GRPOTrainer]: Add SAPO Loss (huggingface#4600) commit 814d4af Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Dec 2 15:52:51 2025 -0700 Move MergeModelCallback to experimental (huggingface#4608) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 2a81076 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Tue Dec 2 15:07:11 2025 +0100 Fixed OpenEnv example scripts (huggingface#4610) commit de343cd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Dec 2 07:32:22 2025 +0100 Remove deprecations for 0.26 release (huggingface#4607) commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (huggingface#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (huggingface#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (huggingface#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (huggingface#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (huggingface#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (huggingface#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (huggingface#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (huggingface#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (huggingface#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (huggingface#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (huggingface#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (huggingface#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (huggingface#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (huggingface#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (huggingface#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (huggingface#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (huggingface#4574)
commit f278d03 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 19:34:42 2025 +0100 Remove no longer applicable warning once BCO was moved to experimental (huggingface#4628) commit e7071bf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Dec 5 10:07:16 2025 -0700 Add logos as assets (huggingface#4627) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 794d87f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 08:45:20 2025 +0100 Add missing experimental autodoc classes to docs (huggingface#4618) commit bc7888d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:48:33 2025 +0100 Raise FutureWarning for trainer moved to experimental (huggingface#4620) commit fce5dfd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:34:04 2025 +0100 Raise warnings at 2nd stack level (huggingface#4621) commit c5da8ec Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Dec 5 07:33:04 2025 +0100 Silence experimental warning during docs build (huggingface#4623) commit 2af35fb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Dec 4 21:23:41 2025 -0700 Clean up model preparation (huggingface#4577) commit cbd90d4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Dec 4 20:05:43 2025 +0100 Remove deprecated batched formatting in GOLDTrainer (huggingface#4622) commit 903b57d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:16:00 2025 +0100 Update ministral notebooks with official bf16 ckpt (huggingface#4626) commit 9266135 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 19:01:46 2025 +0100 Fix link to OpenEnv blog in docs (huggingface#4625) commit 495381d Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Dec 4 11:32:34 2025 +0100 Fix README style (huggingface#4619) commit ddb65e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 21:20:40 2025 +0100 Add experimental imports to docs (huggingface#4616) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5fab472 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Dec 3 17:38:16 2025 +0100 Replace arXiv paper links with HF links (huggingface#4613) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit a3c1dfb Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Dec 3 17:28:45 2025 +0100 Add ministral 3 free notebooks (huggingface#4614) commit 560fd3d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Wed Dec 3 10:12:20 2025 +0000 [GRPOTrainer]: Add SAPO Loss (huggingface#4600) commit 814d4af Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Dec 2 15:52:51 2025 -0700 Move MergeModelCallback to experimental (huggingface#4608) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 2a81076 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Tue Dec 2 15:07:11 2025 +0100 Fixed OpenEnv example scripts (huggingface#4610) commit de343cd Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Dec 2 07:32:22 2025 +0100 Remove deprecations for 0.26 release (huggingface#4607) commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (huggingface#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (huggingface#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (huggingface#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (huggingface#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (huggingface#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (huggingface#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (huggingface#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (huggingface#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (huggingface#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (huggingface#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (huggingface#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (huggingface#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (huggingface#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (huggingface#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (huggingface#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (huggingface#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (huggingface#4574)
What does this PR do?
The Qwen team released a new paper Soft Adaptive Policy Optimization that introduces a new loss called SAPO that gets rid of hard clipping to token losses with a tempature controlled sigmoid gate.
The loss function is formulated as
with r and A being the importance ratios and Advantage scores
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.