[Bug Fix] OnlineDPOTrainer with vLLM Server Mode #4500

YangKai0616 · 2025-11-07T09:58:10Z

What does this PR do?

This PR:

[XPU] Enable vllm_mode="server" for OnlineDPOTrainer on XPU;
[General] Correctly parse the return value of VLLMClient.generate;
[General] Correctly clean up vLLM communicator between multiple UT cases. (Code format reference link)

Root Cause:

Issue 2 is caused by incorrect handling of the return value from vllm_client.generate in _generate_vllm_server.
Without this PR, the test would fail with the following error:

    def _generate_vllm(self, prompts, images=None):
        eos_token_id = self.eos_token_id
        pad_token_id = self.pad_token_id
    
        # Generate completion_ids and prompt_ids based on mode
        if self.vllm_mode == "server":
            completion_ids, prompt_ids = self._generate_vllm_server(prompts, images)
        elif self.vllm_mode == "colocate":
            completion_ids, prompt_ids = self._generate_vllm_colocate(prompts, images)
    
        # Shared padding, masking, and tensor conversion logic
        max_prompt_length = max(len(ids) for ids in prompt_ids)
        prompt_mask = [[0] * (max_prompt_length - len(ids)) + [1] * len(ids) for ids in prompt_ids]
        prompt_ids = [[pad_token_id] * (max_prompt_length - len(ids)) + ids for ids in prompt_ids]
        max_tokens = self.generation_config.max_tokens
        completion_mask = [[1] * len(ids) + [0] * (max_tokens - len(ids)) for ids in completion_ids]
        completion_ids = [
            ids + [eos_token_id] if ids[-1] != eos_token_id and len(ids) < max_tokens else ids
            for ids in completion_ids
        ]
        completion_ids = [ids + [pad_token_id] * (max_tokens - len(ids)) for ids in completion_ids]
    
        # Convert to tensors
        prompt_ids = torch.tensor(prompt_ids, device=self.accelerator.device)
        prompt_mask = torch.tensor(prompt_mask, device=self.accelerator.device)
>       completion_ids = torch.tensor(completion_ids, device=self.accelerator.device)
E       ValueError: too many dimensions 'str'

trl/trainer/online_dpo_trainer.py:714: ValueError

Issue 3 occurred because when running multiple test cases with vLLM server mode, the vLLM communicator was not properly cleaned up between test runs. The VLLMClient.init_communicator() registers cleanup via atexit, which only runs on program exit, not between individual pytest test cases.
Without this PR, the first case of config_name would pass, but the second case would fail during initialization with the following error:

(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780] Invocation of collective_rpc method failed
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780] Traceback (most recent call last):
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 777, in _handle_client_request
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]     result = method(*self._convert_msgspec_args(method, args))
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 416, in collective_rpc
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]     return self.model_executor.collective_rpc(method, timeout, args,
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]     return func(*args, **kwargs)
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]   File "/workspace/tests/test_yk/trl/trl/scripts/vllm_serve.py", line 110, in init_communicator
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780]     raise RuntimeError("Weight update group already initialized. Call close_communicator first.")
(EngineCore_DP0 pid=237629) ERROR 11-07 17:40:03 [core.py:780] RuntimeError: Weight update group already initialized. Call close_communicator first.

YangKai0616 · 2025-11-07T10:04:06Z

Hi @qgallouedec , please help review. Thanks!

YangKai0616 · 2025-11-11T05:49:06Z

@qgallouedec , if there is anything I need do, please feel free to contact me.

YangKai0616 · 2025-11-12T01:54:31Z

@kashif , could you help review. Thanks!

qgallouedec · 2025-11-12T23:35:37Z

Thanks for the PR @YangKai0616, we will probably need some time to review this as the bandwidth is limited on our side, and Online DPO isn't a top priority trainer, please let this PR opened in the meantime. cc @kashif if you've some insights

trl/extras/vllm_client.py

kashif · 2025-11-13T11:10:26Z

thanks @YangKai0616 reviewing it now

trl/trainer/online_dpo_trainer.py

trl/extras/vllm_client.py

trl/trainer/online_dpo_trainer.py

HuggingFaceDocBuilderDev · 2025-11-13T15:21:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2025-11-13T16:42:10Z

CI is failing for another reason

commit 52ed4df Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 20 21:41:23 2025 +0000 Fix style OpenEnv example commit a263946 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 20 14:44:15 2025 +0100 Update OpenEnv guide with latest details (#4552) Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> commit 1a9ff52 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Wed Nov 19 15:34:25 2025 +0100 [OpenEnv] browsergym example script (#4539) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6cbcd94 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:44 2025 +0100 Update OpenEnv example scripts (#4547) commit 8510589 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:20 2025 +0100 Add OpenEnv Script examples to docs (#4533) commit e622196 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 17 03:12:30 2025 -0700 [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward (#4524) commit 1b1242c Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Fri Nov 14 20:51:41 2025 +0100 [OpenEnv] add vllm colocate mode to openenv scripts (#4510) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f39d18a Author: Fabio Milentiansen Sim <sim.fabio.fms@gmail.com> Date: Fri Nov 14 23:39:02 2025 +0700 fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type (#4526) commit d45eaab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 12:12:09 2025 +0100 Add vLLM quantization option for colocate (#4496) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a91d4b3 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 02:19:08 2025 +0100 Prevent upcasting norm layers in `prepare_model_for_kbit_training` (#4457) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 121318e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 17:13:16 2025 -0800 docs: Extend CLI basic usage examples to all supported CLIs (#4425) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 7918320 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 13:20:52 2025 -0700 Remove test trainer args (#4517) commit 102dc41 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:36:43 2025 -0700 Rename `flash-attn` to `flash-attn2` (#4514) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 5de62b0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:05:48 2025 -0700 Add step time metric to GRPO Trainer for performance tracking (#4516) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit f1e6377 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 11:01:19 2025 -0800 Move PPOTrainer to trl.experimental.ppo (#4482) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 01f497e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 10:14:58 2025 -0800 Move NashMDTrainer to experimental module (#4477) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit b6c838a Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 13 16:53:26 2025 +0000 `aws-general-8-plus` runner for Docker build commit ed5c7bb Author: YangKai0616 <kai.yang@intel.com> Date: Fri Nov 14 00:42:48 2025 +0800 [Bug Fix] OnlineDPOTrainer with vLLM Server Mode (#4500) commit ded9bc6 Author: lewtun <lewis.c.tunstall@gmail.com> Date: Thu Nov 13 17:33:59 2025 +0100 Fix Docker images for Liger (#4522) commit fd04760 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 13 11:31:10 2025 +0000 Paper Index: Change `num_completions` to `num_generations` (#4515) commit b7918c0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 20:35:44 2025 -0800 Move GKDTrainer to experimental module (#4474) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07b5011 Author: Tamoghno Kandar <55907205+tamoghnokandar@users.noreply.github.com> Date: Wed Nov 12 20:07:33 2025 -0800 Replace flash attention2 with kernels-community/flash-attn2 (#4426) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 7a57fd4 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Thu Nov 13 11:16:20 2025 +0800 MiniLLM: Fix arguments in config & add to documentation index (#4518) commit a145eaf Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 16:35:46 2025 -0800 refactor: Move CPOTrainer to experimental module (#4470) commit d2dc717 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Thu Nov 13 00:56:47 2025 +0100 Replace `wandb_log_unique_prompts` with `log_unique_prompts` (#4508) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 799b39b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 16:21:05 2025 -0700 `device_map` and `dtype` to `"auto"` by default (#4509) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit a6a2beb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 09:42:31 2025 -0700 Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 (#4513) commit 346701a Author: lewtun <lewis.c.tunstall@gmail.com> Date: Wed Nov 12 17:42:18 2025 +0100 Replace accelerate logging with stdlib in CLI (#4512) commit 4db63af Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Wed Nov 12 02:19:51 2025 +0000 Fix GRPO unsqueeze advantages commit ecb2811 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Wed Nov 12 10:17:22 2025 +0800 Add MiniLLM Trainer (#4504) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 89e4688 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Tue Nov 11 20:36:23 2025 +0100 Add support for images inside tables with Trackio completions logging (#4505) commit 2d3279c Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 11 19:22:25 2025 +0100 Tweak description for vLLM sleep mode (#4506) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 02a3477 Author: Luke Hinds <lukehinds@gmail.com> Date: Mon Nov 10 16:41:51 2025 +0000 Fix link to OpenEnv docs (#4502) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit aaed6c1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Sat Nov 8 08:20:48 2025 -0700 Consistency regarding relative imports (#4498) commit 20760ba Author: burtenshaw <ben.burtenshaw@gmail.com> Date: Fri Nov 7 10:50:50 2025 +0100 [DOCS] update and fix openenv (#4490) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 64cfca4 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 22:47:04 2025 -0800 Move judges to experimental submodule (#4439) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 97ca1a2 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Fri Nov 7 00:20:15 2025 +0000 Fix bugs in CISPO conditions (#4499) commit ffb3dd5 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 16:03:00 2025 -0800 docs: Add PEFT subsection to reducing memory usage guide (#4430) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 43b6541 Author: SolarWindRider <31797478+SolarWindRider@users.noreply.github.com> Date: Fri Nov 7 06:55:34 2025 +0800 Support completion bootstrap for VLM in GRPO/RLOO (#4452) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 642b721 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 22:33:00 2025 +0000 ScaleRL: Add CISPO Loss (#4495) commit 32e9c9f Author: Ishita Bhattacharyya <139248026+ishitab02@users.noreply.github.com> Date: Fri Nov 7 03:37:43 2025 +0530 ⛴️ Add kernels to Docker images (#4445) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 1bcfc50 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 13:40:12 2025 -0800 Move XPOTrainer to trl.experimental.xpo (#4485) Co-authored-by: Invidia19 <54266187+Invidia19@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 37942bc Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 21:32:03 2025 +0000 Buffer samples based on group level stds. (#4492) commit 66cd02a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 6 20:58:25 2025 +0100 Add tiny model Qwen3VLForConditionalGeneration to CI (#4494) commit 32febb4 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 6 18:21:56 2025 +0100 Add LFM2 to SFT notebook examples (#4455)

commit 4cb1a25 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Sat Nov 22 23:31:29 2025 +0100 [SFT] Log mean token accuracy from Liger kernel (#4302) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 468b9d4 Author: Susant <acharysusant@gmail.com> Date: Sun Nov 23 03:40:32 2025 +0530 docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer (#4440) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9bc6206 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Fri Nov 21 17:34:50 2025 -0800 Move PRMTrainer to trl.experimental.prm (#4483) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f7ac974 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 16:01:04 2025 +0100 Update OpenEnv guide with new notebook (#4555) commit c0de042 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 15:40:25 2025 +0100 Add GRPO Wordle OpenEnv Colab (#4542) commit 9f8ef40 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 20 22:36:31 2025 -0800 [ORPO] Move ORPOTrainer to experimental (#4480) commit 3bb5d76 Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com> Date: Thu Nov 20 18:53:10 2025 -0700 fix+docs: `device_map=None` for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index (#4551) commit 375b3eb Author: Jonny Li <jonny_li@live.ca> Date: Thu Nov 20 19:42:45 2025 -0500 Add target_parameters to LoraConfig (#4536) commit 237900d Author: Kristian Schwethelm <47533587+kschwethelm@users.noreply.github.com> Date: Thu Nov 20 23:03:20 2025 +0100 Fix bug with VLM processors in prompt-completion completion text-only training (#4553) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 52ed4df Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 20 21:41:23 2025 +0000 Fix style OpenEnv example commit a263946 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 20 14:44:15 2025 +0100 Update OpenEnv guide with latest details (#4552) Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> commit 1a9ff52 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Wed Nov 19 15:34:25 2025 +0100 [OpenEnv] browsergym example script (#4539) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6cbcd94 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:44 2025 +0100 Update OpenEnv example scripts (#4547) commit 8510589 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:20 2025 +0100 Add OpenEnv Script examples to docs (#4533) commit e622196 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 17 03:12:30 2025 -0700 [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward (#4524) commit 1b1242c Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Fri Nov 14 20:51:41 2025 +0100 [OpenEnv] add vllm colocate mode to openenv scripts (#4510) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f39d18a Author: Fabio Milentiansen Sim <sim.fabio.fms@gmail.com> Date: Fri Nov 14 23:39:02 2025 +0700 fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type (#4526) commit d45eaab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 12:12:09 2025 +0100 Add vLLM quantization option for colocate (#4496) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a91d4b3 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 02:19:08 2025 +0100 Prevent upcasting norm layers in `prepare_model_for_kbit_training` (#4457) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 121318e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 17:13:16 2025 -0800 docs: Extend CLI basic usage examples to all supported CLIs (#4425) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 7918320 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 13:20:52 2025 -0700 Remove test trainer args (#4517) commit 102dc41 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:36:43 2025 -0700 Rename `flash-attn` to `flash-attn2` (#4514) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 5de62b0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:05:48 2025 -0700 Add step time metric to GRPO Trainer for performance tracking (#4516) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit f1e6377 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 11:01:19 2025 -0800 Move PPOTrainer to trl.experimental.ppo (#4482) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 01f497e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 10:14:58 2025 -0800 Move NashMDTrainer to experimental module (#4477) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit b6c838a Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 13 16:53:26 2025 +0000 `aws-general-8-plus` runner for Docker build commit ed5c7bb Author: YangKai0616 <kai.yang@intel.com> Date: Fri Nov 14 00:42:48 2025 +0800 [Bug Fix] OnlineDPOTrainer with vLLM Server Mode (#4500) commit ded9bc6 Author: lewtun <lewis.c.tunstall@gmail.com> Date: Thu Nov 13 17:33:59 2025 +0100 Fix Docker images for Liger (#4522) commit fd04760 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 13 11:31:10 2025 +0000 Paper Index: Change `num_completions` to `num_generations` (#4515) commit b7918c0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 20:35:44 2025 -0800 Move GKDTrainer to experimental module (#4474) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07b5011 Author: Tamoghno Kandar <55907205+tamoghnokandar@users.noreply.github.com> Date: Wed Nov 12 20:07:33 2025 -0800 Replace flash attention2 with kernels-community/flash-attn2 (#4426) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 7a57fd4 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Thu Nov 13 11:16:20 2025 +0800 MiniLLM: Fix arguments in config & add to documentation index (#4518) commit a145eaf Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 16:35:46 2025 -0800 refactor: Move CPOTrainer to experimental module (#4470) commit d2dc717 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Thu Nov 13 00:56:47 2025 +0100 Replace `wandb_log_unique_prompts` with `log_unique_prompts` (#4508) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 799b39b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 16:21:05 2025 -0700 `device_map` and `dtype` to `"auto"` by default (#4509) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit a6a2beb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 09:42:31 2025 -0700 Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 (#4513) commit 346701a Author: lewtun <lewis.c.tunstall@gmail.com> Date: Wed Nov 12 17:42:18 2025 +0100 Replace accelerate logging with stdlib in CLI (#4512) commit 4db63af Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Wed Nov 12 02:19:51 2025 +0000 Fix GRPO unsqueeze advantages commit ecb2811 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Wed Nov 12 10:17:22 2025 +0800 Add MiniLLM Trainer (#4504) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 89e4688 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Tue Nov 11 20:36:23 2025 +0100 Add support for images inside tables with Trackio completions logging (#4505) commit 2d3279c Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 11 19:22:25 2025 +0100 Tweak description for vLLM sleep mode (#4506) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (#4574) commit 547d924 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Nov 25 09:34:22 2025 -0700 Add `shuffle_dataset` option to `SFTTrainer` (#4564) commit b01f8ca Author: iliasmerigh <91261122+iliasmerigh@users.noreply.github.com> Date: Tue Nov 25 17:33:14 2025 +0100 Fix typo in GRPO description in README (#4573) commit 7856d3b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Nov 25 09:32:39 2025 -0700 Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process (#4571) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit 64d089e Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 25 14:39:40 2025 +0100 Reasoning reward (#4563) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 3b7d0e4 Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Tue Nov 25 04:48:06 2025 +0000 Remove Online DPO from stable trainers section in documentation commit 6f3a452 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 24 08:11:49 2025 -0700 Reorder documentation TOC to surface key trainer sections (#4565) commit 46af266 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Nov 24 02:39:25 2025 -0800 docs: Rewrite PEFT integration guide with comprehensive examples (#4421) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit db4f6e5 Author: mingxuetian <108911581+mingxuetian@users.noreply.github.com> Date: Mon Nov 24 09:51:42 2025 +0800 Add `num_generations_eval` parameter for efficient evaluation (#4458) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07f3c95 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Sun Nov 23 17:33:36 2025 -0800 Move OnlineDPOTrainer to experimental module (#4473) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 4cb1a25 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Sat Nov 22 23:31:29 2025 +0100 [SFT] Log mean token accuracy from Liger kernel (#4302) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 468b9d4 Author: Susant <acharysusant@gmail.com> Date: Sun Nov 23 03:40:32 2025 +0530 docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer (#4440) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9bc6206 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Fri Nov 21 17:34:50 2025 -0800 Move PRMTrainer to trl.experimental.prm (#4483) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f7ac974 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 16:01:04 2025 +0100 Update OpenEnv guide with new notebook (#4555) commit c0de042 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 15:40:25 2025 +0100 Add GRPO Wordle OpenEnv Colab (#4542) commit 9f8ef40 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 20 22:36:31 2025 -0800 [ORPO] Move ORPOTrainer to experimental (#4480) commit 3bb5d76 Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com> Date: Thu Nov 20 18:53:10 2025 -0700 fix+docs: `device_map=None` for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index (#4551) commit 375b3eb Author: Jonny Li <jonny_li@live.ca> Date: Thu Nov 20 19:42:45 2025 -0500 Add target_parameters to LoraConfig (#4536) commit 237900d Author: Kristian Schwethelm <47533587+kschwethelm@users.noreply.github.com> Date: Thu Nov 20 23:03:20 2025 +0100 Fix bug with VLM processors in prompt-completion completion text-only training (#4553) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 52ed4df Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 20 21:41:23 2025 +0000 Fix style OpenEnv example commit a263946 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 20 14:44:15 2025 +0100 Update OpenEnv guide with latest details (#4552) Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> commit 1a9ff52 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Wed Nov 19 15:34:25 2025 +0100 [OpenEnv] browsergym example script (#4539) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6cbcd94 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:44 2025 +0100 Update OpenEnv example scripts (#4547) commit 8510589 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:20 2025 +0100 Add OpenEnv Script examples to docs (#4533) commit e622196 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 17 03:12:30 2025 -0700 [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward (#4524) commit 1b1242c Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Fri Nov 14 20:51:41 2025 +0100 [OpenEnv] add vllm colocate mode to openenv scripts (#4510) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f39d18a Author: Fabio Milentiansen Sim <sim.fabio.fms@gmail.com> Date: Fri Nov 14 23:39:02 2025 +0700 fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type (#4526) commit d45eaab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 12:12:09 2025 +0100 Add vLLM quantization option for colocate (#4496) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a91d4b3 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 02:19:08 2025 +0100 Prevent upcasting norm layers in `prepare_model_for_kbit_training` (#4457) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 121318e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 17:13:16 2025 -0800 docs: Extend CLI basic usage examples to all supported CLIs (#4425) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 7918320 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 13:20:52 2025 -0700 Remove test trainer args (#4517) commit 102dc41 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:36:43 2025 -0700 Rename `flash-attn` to `flash-attn2` (#4514) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 5de62b0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:05:48 2025 -0700 Add step time metric to GRPO Trainer for performance tracking (#4516) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit f1e6377 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 11:01:19 2025 -0800 Move PPOTrainer to trl.experimental.ppo (#4482) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 01f497e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 10:14:58 2025 -0800 Move NashMDTrainer to experimental module (#4477) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit b6c838a Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 13 16:53:26 2025 +0000 `aws-general-8-plus` runner for Docker build commit ed5c7bb Author: YangKai0616 <kai.yang@intel.com> Date: Fri Nov 14 00:42:48 2025 +0800 [Bug Fix] OnlineDPOTrainer with vLLM Server Mode (#4500) commit ded9bc6 Author: lewtun <lewis.c.tunstall@gmail.com> Date: Thu Nov 13 17:33:59 2025 +0100 Fix Docker images for Liger (#4522) commit fd04760 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 13 11:31:10 2025 +0000 Paper Index: Change `num_completions` to `num_generations` (#4515) commit b7918c0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 20:35:44 2025 -0800 Move GKDTrainer to experimental module (#4474) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07b5011 Author: Tamoghno Kandar <55907205+tamoghnokandar@users.noreply.github.com> Date: Wed Nov 12 20:07:33 2025 -0800 Replace flash attention2 with kernels-community/flash-attn2 (#4426) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 7a57fd4 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Thu Nov 13 11:16:20 2025 +0800 MiniLLM: Fix arguments in config & add to documentation index (#4518) commit a145eaf Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 16:35:46 2025 -0800 refactor: Move CPOTrainer to experimental module (#4470) commit d2dc717 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Thu Nov 13 00:56:47 2025 +0100 Replace `wandb_log_unique_prompts` with `log_unique_prompts` (#4508) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 799b39b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 16:21:05 2025 -0700 `device_map` and `dtype` to `"auto"` by default (#4509) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit a6a2beb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 09:42:31 2025 -0700 Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 (#4513) commit 346701a Author: lewtun <lewis.c.tunstall@gmail.com> Date: Wed Nov 12 17:42:18 2025 +0100 Replace accelerate logging with stdlib in CLI (#4512) commit 4db63af Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Wed Nov 12 02:19:51 2025 +0000 Fix GRPO unsqueeze advantages commit ecb2811 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Wed Nov 12 10:17:22 2025 +0800 Add MiniLLM Trainer (#4504) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 89e4688 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Tue Nov 11 20:36:23 2025 +0100 Add support for images inside tables with Trackio completions logging (#4505) commit 2d3279c Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 11 19:22:25 2025 +0100 Tweak description for vLLM sleep mode (#4506) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 02a3477 Author: Luke Hinds <lukehinds@gmail.com> Date: Mon Nov 10 16:41:51 2025 +0000 Fix link to OpenEnv docs (#4502) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit aaed6c1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Sat Nov 8 08:20:48 2025 -0700 Consistency regarding relative imports (#4498) commit 20760ba Author: burtenshaw <ben.burtenshaw@gmail.com> Date: Fri Nov 7 10:50:50 2025 +0100 [DOCS] update and fix openenv (#4490) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 64cfca4 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 22:47:04 2025 -0800 Move judges to experimental submodule (#4439) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 97ca1a2 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Fri Nov 7 00:20:15 2025 +0000 Fix bugs in CISPO conditions (#4499) commit ffb3dd5 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 16:03:00 2025 -0800 docs: Add PEFT subsection to reducing memory usage guide (#4430) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 43b6541 Author: SolarWindRider <31797478+SolarWindRider@users.noreply.github.com> Date: Fri Nov 7 06:55:34 2025 +0800 Support completion bootstrap for VLM in GRPO/RLOO (#4452) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 642b721 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 22:33:00 2025 +0000 ScaleRL: Add CISPO Loss (#4495) commit 32e9c9f Author: Ishita Bhattacharyya <139248026+ishitab02@users.noreply.github.com> Date: Fri Nov 7 03:37:43 2025 +0530 ⛴️ Add kernels to Docker images (#4445) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 1bcfc50 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 13:40:12 2025 -0800 Move XPOTrainer to trl.experimental.xpo (#4485) Co-authored-by: Invidia19 <54266187+Invidia19@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 37942bc Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 21:32:03 2025 +0000 Buffer samples based on group level stds. (#4492) commit 66cd02a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 6 20:58:25 2025 +0100 Add tiny model Qwen3VLForConditionalGeneration to CI (#4494) commit 32febb4 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 6 18:21:56 2025 +0100 Add LFM2 to SFT notebook examples (#4455)

YangKai0616 added 3 commits November 7, 2025 07:23

Fix bugs

ff1acf2

Update

d51b572

Merge branch 'main' into main

9df08aa

Merge branch 'main' into main

a930cdb

qgallouedec reviewed Nov 12, 2025

View reviewed changes

trl/extras/vllm_client.py Outdated Show resolved Hide resolved

kashif self-assigned this Nov 13, 2025

kashif reviewed Nov 13, 2025

View reviewed changes

trl/trainer/online_dpo_trainer.py Outdated Show resolved Hide resolved

kashif reviewed Nov 13, 2025

View reviewed changes

trl/extras/vllm_client.py Outdated Show resolved Hide resolved

kashif reviewed Nov 13, 2025

View reviewed changes

trl/trainer/online_dpo_trainer.py Show resolved Hide resolved

YangKai0616 added 2 commits November 13, 2025 14:45

Update

b12e1cf

Merge branch 'main' into main

5fa7118

kashif approved these changes Nov 13, 2025

View reviewed changes

make precommit

8cce298

kashif merged commit ed5c7bb into huggingface:main Nov 13, 2025
2 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Fix] OnlineDPOTrainer with vLLM Server Mode #4500

[Bug Fix] OnlineDPOTrainer with vLLM Server Mode #4500

Uh oh!

YangKai0616 commented Nov 7, 2025 •

edited

Loading

YangKai0616 commented Nov 7, 2025

YangKai0616 commented Nov 11, 2025

YangKai0616 commented Nov 12, 2025

qgallouedec commented Nov 12, 2025

Uh oh!

kashif commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 13, 2025

kashif commented Nov 13, 2025

Uh oh!

Labels

4 participants

[Bug Fix] OnlineDPOTrainer with vLLM Server Mode #4500

[Bug Fix] OnlineDPOTrainer with vLLM Server Mode #4500

Uh oh!

Conversation

YangKai0616 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

YangKai0616 commented Nov 7, 2025

YangKai0616 commented Nov 11, 2025

YangKai0616 commented Nov 12, 2025

qgallouedec commented Nov 12, 2025

Uh oh!

kashif commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 13, 2025

kashif commented Nov 13, 2025

Uh oh!

Labels

4 participants

YangKai0616 commented Nov 7, 2025 •

edited

Loading