Cameron R. Wolfe, Ph.D.’s Post

Is there truly a gap in performance between online and offline RL training for LLMs? Here’s what the research says… TL;DR: There is a clear performance gap between online and offline RL algorithms, especially in large-scale LLM training. However, this gap can be minimized by incorporating on-policy samples into offline RL algorithms, creating an iterative or semi-online approach. RL for LLMs: A few variants of RL (i.e., RLHF and RLVR) are used for training LLMs. These techniques differ in how they derive a reward, but they both operate by generating completions for a set of prompts, computing rewards for these completions, and using those rewards to update the LLM’s parameters via an RL optimizer. RL optimizers: We usually use an (online) policy gradient-based RL optimizer (e.g., REINFORCE, PPO, or GRPO) to derive the policy update for an LLM. PPO-based RL has been the de facto choice in the past, but PPO: - Is memory intensive due to storing four copies of the model (policy, reference, reward and critic / value models). - Tends to be unstable and requires a lot of domain knowledge to set up correctly. Plus, online RL is difficult due to the complexity of generating on-policy completions during training. Instead, we could use a simpler offline algorithm like DPO, as well as a semi-online algorithm like iterative DPO or rejection sampling. (1) OOD data: In [1], the authors compare DPO with online RL algorithms, finding a clear bias in DPO. When DPO encounters out-of-distribution (OOD) or unseen responses, we see in [1] that the resulting LLM becomes biased toward these responses and behaves unpredictably. In contrast, PPO handles OOD data more robustly and learns a correct policy. We can mitigate these issues with DPO by performing iterative / semi-online DPO, where we: 1. Collect new, on-policy completions from the current LLM. 2. Score them with a reward model to create preference pairs. 3. Run DPO on the new data. 4. Repeat. (2) On-policy + negative gradient: Authors in [2] identify two key properties of alignment algorithms that lead to competitive performance: 1. On-policy sampling. 2. Negative gradients. Online RL algorithms naturally contain on-policy sampling. We can also incorporate on-policy samples into offline or RL-free algorithms by including on-policy samples in the training data. Negative gradients refer to the idea of pushing down the probability of “bad” completions. This naturally happens in PPO-based RL and DPO, but not in maximum likelihood algorithms like SFT. (3) DPO vs. PPO at scale: Authors in [3] perform an extensive comparison between DPO and PPO-based RLHF with large-scale LLMs (i.e., 70B parameters), whereas prior work mostly focuses on smaller LLMs in experimental settings. In this work, we see a clear gap between the performance of DPO and PPO at scale. However, authors find that the quality of training data still has a bigger impact on LLM performance than algorithm choice (i.e., online vs. offline).

  • table

For full details, check out my writeup on the online-offline RL performance gap: https://cameronrwolfe.substack.com/p/online-rl Here are citations for the above papers: [1] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." arXiv preprint arXiv:2404.10719 (2024). [2] Tajwar, Fahim, et al. "Preference fine-tuning of llms should leverage suboptimal, on-policy data." arXiv preprint arXiv:2404.14367 (2024). [3] Ivison, Hamish, et al. "Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback." Advances in neural information processing systems 37 (2024): 36602-36633.

Troy Schultz

Building Uncensored.AI, Enthusiast, GenAI researcher, AI tools dev, chatbots. Innovator of Mermaid RAG LLMs utilizing Knowledge Graphs, Input to flow maps, system diagrams, storyboards, consequence outcome prediction.

3w

Great post, Cameron R. Wolfe, Ph.D. You provided a really clear breakdown of the RLHF vs. DPO/PPO tradeoffs. Since you’re digging into this space, would you be open to reviewing Kahneman–Tversky Optimization (KTO) in relation to other human-aware loss optimization methods? I’m especially interested in how its advantages of not needing a reward model and using much easier-to-collect binary feedback compare to the online/offline RL gap you outlined. Would love your perspective as this is the current training regime I am using for my new BlackSheep models in alignment preference tuning after Abliteration to reach the top of benchmarks on UGI for willingness and uncensored general knowledge since it doesn't require creating optimal chosen and rejected samples like dpo.

Like
Reply
Shivam Bharadwaj

AI Engineering | Neuroscience | Ophthalmic AI | Technical Humour & Insights

3w

Great share. The breakdown makes it clear how PPO and DPO behave differently at scale. What stands out is the point that even with stronger algorithms, the quality of training data still outweighs the algorithm choice. Curious to see how semi-online DPO evolves, especially if it can balance the robustness of PPO with the efficiency of offline methods.

Michael Spencer

Founder, Newsletter operator, A.I. Writer, Emerging tech analyst.

3w

This is a fascinating topic.

See more comments

To view or add a comment, sign in

Explore content categories