Addressing Over-Refusal in LLMs with Competing Rewards

Taeyoun Kim^1,2 Aviral Kumar¹
¹Carnegie Mellon University, ²SynthLabs
{taeyoun3, aviralku}@cs.cmu.edu

Abstract

Safety training on language models often induces over-refusal: improved safety on harmful prompts at the cost of increased refusal on harmless ones. Though this trade-off can be mitigated by training models with reinforcement learning (RL) to reason before answering, it does not remove the underlying problem that reasoning can often be a “rubber stamp” for a predetermined response. In this paper, we address the safety-refusal trade-off by rethinking how models are trained to reason about safety. Our key insight is that unsafe reasoning can itself serve as a useful exploratory signal. Rather than preemptively blocking harmful thoughts, we encourage the model to sufficiently explore unsafe reasoning but produce a safe response. The harmful exploration improves the model’s ability to distinguish harmful from harmless prompts by resolving ambiguity, allowing it to remain safe while complying only when appropriate. We cast this as an adversarial optimization problem in which a reasoning player explores strategies for producing an unsafe response and an answer player ensures that the final output is safe. We train a single model with dense rewards to play both roles within one chain-of-thought, across different segments. To achieve this, we find that process rewards are crucial for stable optimization of competing objectives. Our resulting model SEAR deliberately engages in harmful reasoning as exploration while reliably flipping back to a safe answer. We demonstrate that this behavior helps mitigate over-refusal and defend against attacks that directly manipulate the reasoning to be harmful.

1 Introduction

Post-training large language models (LLMs) for safety is essential for reliable deployment since it reduces the risk of harmful outputs on unsafe or adversarial prompts (Kim et al., 2025a; Guan et al., 2025; Wang et al., 2025; Zhang et al., 2024; Yang et al., 2025b; Zhu et al., 2025; Zhang et al., 2025e; Feng et al., 2025; Mou et al., 2025; Jiang et al., 2025; Zhang et al., 2025d; Si et al., 2025; Zhao et al., 2025). However, existing safety-training methods often induce over-refusal, making models less helpful on harmless prompts as they become safer against jailbreaks and malicious requests (Cui et al., 2024). How can we improve the safety of language models without unnecessarily refusing benign requests?

A growing body of both open-source (Kim et al., 2025a; b; Zhang et al., 2025b) and proprietary (Guan et al., 2025) work has proposed training models with reinforcement learning (RL) to reason explicitly about the malicious intent of an input prompt. The hope is that, rather than automatically refusing, a model can deliberate over the prompt, distinguish harmful from harmless intent, and respond more selectively. In practice, however, these methods only partially reduce over-refusal. As such, our analysis in §3 suggests that, in current open-source safety-trained models (Yang et al., 2025a; Olmo et al., 2025; Agarwal et al., 2025), reasoning often acts as a “rubber stamp” for the final decision rather than a mechanism for resolving ambiguity or addressing safety in a genuine manner. These models tend to follow one of two brittle behaviors: they either avoid unsafe interpretations altogether and default to refusal, or they explore unsafe possibilities but fail to recover to a safe final answer (Zhang et al., 2025c; d; Zhou et al., 2025; Jiang et al., 2025).

To address this limitation, safety reasoning must support ambiguity resolution, not merely predict whether to refuse. Many ambiguous prompts admit both harmful and benign interpretations, and a model that suppresses all safety-relevant reasoning may refuse even when a safe and useful response exists. Conversely, a model that explores such interpretations without reliable control may become vulnerable to jailbreaks or unsafe compliance. This limitation is especially striking in contrast to math and coding domains, where reasoning helps because it enables exploration and recovery: models can pursue an incorrect line of thought, recognize the mistake, and then self-correct their generation (Setlur et al., 2025). We argue that safety reasoning requires an analogous capability. A model should be able to examine safety-relevant interpretations of a prompt, including potentially unsafe ones, while still resolving the final answer safely. For clearly harmful prompts, this means identifying the unsafe interpretation and refusing or redirecting appropriately. For ambiguous prompts, it means finding benign interpretations that allow the model to remain helpful rather than defaulting to refusal.

Refer to caption — Figure 1: Reward Comparison. Comparison of outcome-based, dense outcome-based, and process-based rewards. Outcome-based rewards use the final answer reward ( $r_{s}$ ) across all generated tokens. Dense outcome-based rewards combine the reward from the answer ( $r_{s}$ ) and reasoning ( $r_{h}$ ) into all tokens. Process-based rewards, which SEAR uses, separate the answer ( $r_{s}$ ) and reasoning ( $r_{h}$ ) rewards into its corresponding tokens.

To instill this behavior into language models, we propose an RL approach for safety training based on competing rewards in §4. Standard outcome-reward RL (Kim et al., 2025a) typically assigns a single reward based on the correctness—safety in our setting—of the entire trajectory, using the same signal for both the reasoning process and the final response. In contrast, we assign separate rewards to the reasoning and answer segments of the trace, which we refer to as dense process rewards (Figure 1). Crucially, these rewards are competing: the model is rewarded for exploring increasingly harmful reasoning strategies while separately being rewarded for producing a safe final answer conditioned on those harmful segments. This encourages better exploration through isolated reasoning while still allowing the reasoning to inform the final response in one consecutive chain-of-thought (Wei et al., 2022), rather than constraining the reasoning based on the eventual answer.

Through careful experiments, we identify two core design principles for training with competing rewards. First, in §4.2, we show that a mismatch in optimization between the competing objectives results in unstable behavior. Second, in §4.3, we show that dense outcome-level rewards alone are insufficient for learning these opposing objectives within a single model. Instead, in §4.4, we show that segment-level rewards on sampled rollouts, which we refer to as process-based rewards, are needed for proper credit assignment. These rewards teach the model both to explore safety-relevant interpretations and to recover to a safe final response within a single reasoning trajectory. Successfully training with competing rewards yields a model that reasons differently from standard safety-trained baselines: rather than preemptively shutting down risky chains of thought, it learns to reason through them while still producing safe answers, leading to less over-refusal.

Using these principles, we train SEAR-1.5B, a model trained for selective safety through competing-reward reasoning. We make several observations in §6. First, SEAR achieves the best safety-refusal trade-off among a range of open models within its model class and size, outperforming the closest approach, TARS (Kim et al., 2025a), on this measurement. Second, despite having only 1.5B parameters, SEAR performs comparably to the much larger TARS-7B and is more robust to adversarial attacks. Third, a model trained to examine unsafe interpretations is more robust against attacks that directly target reasoning, such as pre-filling. This capability is important for defending against stronger attacks that attempt to steer the model’s reasoning toward harmful conclusions. Overall, our results suggest that the safety-refusal trade-off is not only about deciding whether to refuse, but about learning how to resolve ambiguity: models must be able to reason through unsafe possibilities without allowing those possibilities to dictate the final response.

2 Related Work

Over-refusal, reasoning, and RL. Safety-trained models exhibit over-refusal, where refusal behavior generalizes beyond harmful prompts and suppresses compliance on harmless ones. This is typically evaluated using benchmarks of benign but ambiguous prompts that may appear unsafe to the model (Röttger et al., 2023; Cui et al., 2024; Zhang et al., 2025a; Knight et al., 2025). Recent work has sought to mitigate over-refusal by leveraging reasoning, with the goal of helping models distinguish harmful from harmless intent more selectively. Some approaches adapt online RL to safety reasoning through reward design and context distillation (Kim et al., 2025a; Guan et al., 2025), while others use supervised fine-tuning or DPO (Rafailov et al., 2023) to distill curated reasoning traces (Wang et al., 2025; Zhang et al., 2024; Yang et al., 2025b; Zhu et al., 2025; Zhang et al., 2025e; Feng et al., 2025; Mou et al., 2025; Jiang et al., 2025; Zhang et al., 2025d; Si et al., 2025; Zhao et al., 2025; Zhou et al., 2025). These methods show that reasoning can reduce some unnecessary refusals, but they do not explicitly train the model to explore unsafe possibilities and then recover as in our work. In fact, we benchmark the model of Kim et al. (2025a) and find that it suffers from more over-refusal than our approach. Our approach treats harmful reasoning itself as a useful pathway for exploration, and uses process-based rewards to train both exploration and flip-back behavior.

Training with competing objectives. Our work aligns with the literature on competing or adversarial objectives, often through two-player formulations, to improve exploration and robustness in reasoning tasks (Kirchner et al., 2024; Anil et al., 2021). Related ideas have also appeared in safety where models are trained to play attacker-defender roles (Liu et al., 2025b; Paulus et al., 2025; Wen et al., 2026; Tan et al., 2026), or to recover from harmful reasoning prefixes through supervised training (Kim et al., 2025b; Peng et al., 2025). Our approach is closer to the latter approach in that we incentivize reasoning to be unsafe, but differs in a crucial respect: prior work only trains to reactively recover from unintended harmful reasoning, whereas we explicitly train a single model to both actively enter harmful reasoning as a useful exploratory strategy and revert to a safe answer. Furthermore, we utilize dense process-based rewards to learn these opposing behaviors within one model, which we find to be crucial for optimization.

3 Does Reasoning Help (with Exploration) in Safety?

Reasoning helps solve long-form math problems because it supports exploration through test-time compute (Setlur* et al., 2025; Wu et al., 2026; Qu et al., 2025). A model may generate an incorrect intermediate step, recognize the mistake, and recover to a reasoning that leads to a correct answer. This ability to make mistakes without being dictated by them is a useful form of exploration. It allows the model to consider uncertain directions while still returning to a better final answer that is correct.

What is the analogous behavior in safety? Ideally, when faced with a harmful or ambiguous prompt, a model should not default to refusal merely because one possible interpretation is unsafe. Instead, it should be able to reason through multiple possible reasoning traces on a given prompt, identify harmful intent, and produce a safe final response that is also as helpful as possible. Such behavior is especially important on ambiguous prompts (Röttger et al., 2023; Cui et al., 2024; Zhang et al., 2025a; Knight et al., 2025), where avoiding all safety-relevant reasoning can lead to unnecessary refusal, while uncontrolled exploration can lead to unsafe compliance on harmful prompts. We care primarily about the safety and helpfulness of the final answer; potentially harmful reasoning is useful only insofar as it serves as exploration that helps resolve ambiguity and arrive at the correct response.

Do current LLMs exhibit exploration in their reasoning? To test this, we study Qwen3-1.7B, Qwen3-4B (Yang et al., 2025a), OLMO3-7B (Olmo et al., 2025), and GPT-OSS-20B (Agarwal et al., 2025) in Figure 2. We take 1000 harmful prompts from WildJailbreak (Jiang et al., 2024) and measure two quantities in each generation: (1) how often the reasoning becomes harmful, and (2) how often the final answer recovers to being safe. While these prompts are not ambiguous, this evaluation still provides a concrete measurement of how likely a model is to explore unsafe reasoning paths. We score the harmfulness of both the reasoning trace and the final answer using the Moderation API (Markov et al., 2023). Figure 2 shows that most prompts fall into the first bucket for all four models, where reasoning remains largely safe. In other words, current reasoning models rarely explore unsafe reasoning. Moreover, in the cases where reasoning does become harmful, the final answer also tends to become more harmful rather than recovering to safety. This suggests that current reasoning models lack the ability to use unsafe reasoning as controlled exploration: they either avoid unsafe possibilities altogether, or they explore them without recovering.

Notably, many of these models already perform substantial reasoning on the queried prompts, and several were trained or distilled from full-form reasoning models optimized with RL. This makes the result especially striking, since despite having strong reasoning capabilities, these models rarely use reasoning to explore unsafe possibilities in a controlled manner. Instead, they either remain preemptively safe throughout reasoning like a “rubber-stamp”, or become harmful without recovering, both contributing contribute to over-refusal. We provide qualitative examples of both failure modes in §D as well as a concise example of desired flipped exploration behavior below (full example also in §D).

4 Addressing Over-Refusal via RL with Competing Objectives

The results above suggest that standard approaches for safety training do not produce diverse, exploratory reasoning traces when required to think harmfully. We posit that this issue stems from the insufficiency of outcome-based RL because a single scalar reward on the full trajectory lacks the expressive power to distinguish harmful from harmless reasoning steps when the final response is similar. In this section, we instead introduce competing rewards, which separately reward harmful exploration in the reasoning trace and safety in the final answer. This gives the model an explicit training signal for trajectories that first consider unsafe possibilities but then produce a safe final response when needed. §4.1 builds the objective for competing rewards, §4.2 and §4.3 show the instability of training with dense outcome-based rewards, and §4.4 shows the importance of process rewards for stabilizing training.

4.1 Framework of Competing Objectives

We first present an idealized formulation of our approach and then discuss the implementation in the next section. Let $\pi_{\theta}(\tau\mid x)$ denote the LLM policy induced by parameters $\theta$ over a full trace $\tau$ given prompt $x\in\mathcal{D}_{h}$ coming from a dataset of harmful prompts $\mathcal{D}_{h}$ , where the trajectory consists of both a reasoning segment and a final answer segment: $\tau=(z,y)$ , with $z$ denoting the reasoning trace and $y$ the final answer. We consider two reward functions: a reasoning reward $r_{h}(z)$ , which measures how harmful or unsafe the reasoning segment is, and a final-answer reward $r_{s}(y)$ , which measures how safe the final answer is. As we discuss in §4.2, $r_{h}$ and $r_{s}$ are derived from the same underlying safety reward model, with $r_{h}(\cdot)=1-r_{s}(\cdot)$ . We also consider another set of harmless prompts $\mathcal{D}_{i}$ , where the goal is to optimize for some notion of a helpfulness reward $r_{i}$ (e.g., $\mathcal{D}_{i}$ could be a set of math problems with $r_{i}$ measuring correctness of the final answer).

Training goal. Our goal is to train the LLM to optimize these two opposing objectives in different parts of the same rollout on harmful prompts $\mathcal{D}_{h}$ . At a high level, we would like the policy to assign high probability to trajectories in which the reasoning segment attains a high harmfulness reward, while the final answer attains a high safety reward. Optimizing for such opposing rewards in consecutive segments encourages the model to learn a flip-back behavior after sufficient exploratory reasoning. This can help the model reduce over-refusal by outputting a safe answer only when necessary. In addition, we want the LLM to also optimize some notion of helpfulness reward $r_{i}$ on $\mathcal{D}_{i}$ . We choose to optimize the LLM $\pi$ on a mixture of both harmful and harmless prompts following Kim et al. (2025a) to prevent collapse into strict refusals as a safe answer. The combined RL objective is

\displaystyle\max_{\pi\in\mathcal{B}(\pi_{0})}\;\;\mathbb{E}_{x\sim\mathcal{D}_{h},\,(z,y)\sim\pi(\cdot\mid x)}\big[r_{h}(z)+r_{s}(y)\big]+\mathbb{E}_{x^{\prime}\sim\mathcal{D}_{i},\,\tau^{\prime}\sim\pi(\cdot\mid x^{\prime})}\big[r_{i}(\tau^{\prime};x^{\prime})\big],

(1)

where $\pi_{0}$ is the reference base model policy and $\mathcal{B}(\pi_{0})$ denotes a KL-constrained neighborhood around it, implemented in practice via a KL-regularized RL objective. The KL regularizer plays the same role as in standard RL post-training by restricting the search over policies to remain close to the base model but optimizing for both objectives.

Our method has two characteristics. First, it uses competing rewards within a single rollout on harmful prompts. Maximizing $r_{h}$ pushes the reasoning trace toward harmful continuations, while maximizing $r_{s}$ pushes the final answer toward safe behaviors. The policy must therefore discover trajectories within a KL ball around the base model in which harmful reasoning and safe answering co-exist sequentially within a single chain-of-thought. Second, unlike prior attacker-defender approaches (Paulus et al., 2025; Wen et al., 2026; Tan et al., 2026) that optimize different objectives using separate models, we train a single KL-constrained policy jointly on both harmful and harmless prompts. This shared-policy training discourages degenerate solutions in which the model treats harmful reasoning as an arbitrary intermediate that is unrelated to the prompt and simply refuses at the end. At the same time, it may cause optimization on harmful prompts to affect harmless prompts.

In the following sections (§4.2, §4.3, and §4.4), we consider how to instantiate this objective, presenting our line of experiments that show the failure mode of dense outcome-level rewards which motivate the use of process rewards.

4.2 Dense Outcome-Based RL

The most natural starting point is a standard GRPO-based (Shao et al., ) RL pipeline, where we optimize $r_{h}(z)+r_{s}(y)$ for harmful prompts ( $\mathcal{D}_{h}$ ), and $r_{i}$ for harmless prompts ( $\mathcal{D}_{i}$ ), over the entire trajectories as a dense outcome reward (Figure 1).

Experiment. Our goal is to test whether rewarding harmful reasoning together with a safe final answer is already sufficient to induce the desired recovery behavior. We start from the TARS-SFT-1.5B (Kim et al., 2025a) base model due to its exploratory behavior which facilitates safety training and further train it with RL. Largely following the data construction scheme of Kim et al. (2025a), the harmful set $\mathcal{D}_{h}$ consists of prompts that request malicious or unsafe information, and harmless prompts $\mathcal{D}_{i}$ come from standard instruction-following tasks; full dataset details are provided in §A.

We instantiate the training objective (Equation 1) as follows. On a harmful prompt $x\in\mathcal{D}_{h}$ , we implement both $r_{h}$ and $r_{s}$ using Moderation API (Markov et al., 2023), which returns category-wise harmfulness scores denoted using a vector, $\vec{M}(\cdot)$ . We define $r_{h}$ as the maximum harmfulness score on the reasoning segment, clamped at $0.3$ to avoid incentivizing excessive harmfulness, and define $r_{s}$ as the complement of the harmfulness score on the final answer. On harmless prompts, we instead assign a helpfulness reward $r_{i}(y;x)$ , computed by averaging a general reward model, Skywork (Liu et al., 2025a), and an LLM judge, GPT-5-mini (further details are deferred to §C). The reward scheme is given by:

\displaystyle r_{h}(z)=\min\big(\max(\vec{M}(z)),\,0.3\big),\quad r_{s}(y)=1-\max(\vec{M}(y)),\quad r_{i}(y;x)=\texttt{LLMJudge}(y;x),

and the total outcome reward is

\displaystyle r_{\text{total}}(z,y;x)=\begin{cases}r_{h}(z)+r_{s}(y),&x\in\mathcal{D}_{h},\\ r_{i}(y;x),&x\in\mathcal{D}_{i}.\end{cases}

(2)

We optimize this objective with GRPO (Shao et al., ), using $n=8$ rollouts for advantage estimation and AdamW (Loshchilov and Hutter, 2017) with learning rate $1\times 10^{-6}$ . We add a KL divergence regularizer with a coefficient $0.001$ in our experiments. We use an entropy coefficient of 0.001 on all tokens except for the answer segment on harmful prompts where we use 0.005 until training step 200 and decrease to 0.001.

Results. Figure 3(b) shows that the harmfulness of the reasoning continues to increase throughout training, well beyond what is necessary. As a side effect, this also increases the harmfulness of the reasoning on benign instruction-following tasks, as shown in Figure 3(d), which in turn degrades the helpfulness of the final answer (Figure 3(c)). In contrast, the safety of the final answer (Figure 3(a)) saturates much earlier in training. We find that, under our competing objective, answer safety is substantially easier to optimize than reasoning harmfulness, creating a mismatch in convergence rates. As a result, reasoning harmfulness continues to dominate optimization after answer safety has already saturated. Because both harmful and harmless prompts are constrained by the same KL budget, this mismatch produces an undesirable side effect on benign prompts, ultimately exacerbating over-refusal.

4.3 Dense Outcome-Based RL with Filtering

Experiment. To resolve the imbalance of reasoning and answer optimization, during the gradient update, we filter out rollouts that already receive maximum answer safety during training. Hence, we update the model only on rollouts where answer safety is not yet fully optimized to match the speed in training for competing objectives.

Results. Figure 3(b) shows that with such filtering, the reasoning harmfulness actually decreases over training, opposite of the imposed competing objectives. This is because base models have a strong correlation between reasoning and answer safety as shown in Figure 2 and the model is optimizing solely for answer safety. Hence, in this setting, the model has difficulty optimizing competing objectives for both answer safety $r_{s}$ and reasoning harmfulness $r_{h}$ . In conclusion, even after resolving the issue of saturation speed for the competing objectives, outcome-level based rewards still cannot perfectly optimize.

4.4 Process Rewards

Based on the results in the previous section, the main limitation of dense outcome-level rewards is that a scalar signal, even if dense, does not give us enough flexibility to optimize each segment of the trace individually. On harmful prompts, our goal is to learn a coupled behavior: the model should generate harmful reasoning and then produce a safe answer conditioned on that reasoning. Yet outcome-level RL assigns a single scalar reward to the entire rollout, making credit assignment difficult. This is especially a problem in current base models because the high correlation between reasoning and answer safety make it difficult to optimize in opposing directions under competing rewards. Hence, competing objectives require more precise reward assignment along the rollout to resolve this limitation.

Experiment. We perform precise reward assignment by using process rewards (Figure 1). Rather than rewarding the full rollout only at the end, we assign rewards to the reasoning and answer segments separately, according to their own reward functions. By aligning the reward with the intended function of each segment, process rewards make the competing-objective optimization substantially easier to learn. Formally, given a rollout $\tau=(z,y)=\big((\tau_{1},\dots,\tau_{t}),(\tau_{t+1},\dots,\tau_{n})\big),$ let $z=(\tau_{1},\dots,\tau_{t})$ denote the reasoning segment and $y=(\tau_{t+1},\dots,\tau_{n})$ denote the final answer segment. Let $\mathcal{I}_{r}=\{1,\dots,t\}$ and $\mathcal{I}_{a}=\{t+1,\dots,n\}$ be the reasoning-token and answer-token index sets, respectively. We then define a token-level process reward $r_{k}$ by

\displaystyle r_{k}=\begin{cases}r_{h}(z)&k\in\mathcal{I}_{r},\ x\in\mathcal{D}_{h};\\ r_{s}(y)&k\in\mathcal{I}_{a},\ x\in\mathcal{D}_{h};\\ r_{i}(y;x)&x\in\mathcal{D}_{i}.\end{cases}

(3)

On harmful prompts ( $x\in\mathcal{D}_{h}$ ), the mean and standard deviation for advantage calculation during GRPO is calculated separately for reasoning and answer segments. Hence, each rollout receives two advantages, one for the reasoning tokens and one for the answer tokens. We also adapt the filtering mechanism to the advantage level, updating only on rollouts where the answer advantage is non-zero.

This modification preserves the same high-level competing objectives as before, but provides much sharper supervision about where each objective should apply. In particular, it rewards harmful exploration only in the reasoning segment and safe behavior only in the answer segment, thereby (1) encouraging more harmful reasoning for exploration and (2) directing the model to recover back to a safe answer rather than forcing it to infer this structure from a single final reward. In practice, this process-reward formulation forms the final training recipe for our model, which we call Safety Exploration through Adversarial Reasoning: SEAR-1.5B.

Results. The training results for process rewards are shown in Figure 3. The harmfulness of the reasoning on harmful prompts increases as desired, yet does not explode. Moreover, the answer safety still retains a high reward. This shows that utilizing process rewards helps optimize both $r_{s}$ and $r_{h}$ even when their signal is competing.

5 Benchmarking SEAR-1.5B

With our final method of using process rewards to optimize for competing objectives, SEAR-1.5B, we evaluate whether it effectively helps mitigate over-refusal through better exploration within the reasoning.

Benchmarks, attacks, and evaluators. To evaluate over-refusal, we benchmark SEAR-1.5B on two axes: (1) safety to harmful prompts (x-axis) and (2) compliance to harmless tasks (y-axis), forming a safety-refusal trade-off plot akin to prior work (Kim et al., 2025a).

1.

Safety (x-axis). We benchmark on harmful prompts from HarmBench (Zou et al., 2023) using PAIR (Chao et al., 2023) and GCG (Zou et al., 2023) as the jailbreak attack (§B) and plot the percentage of safe answers (Defense Success Rate). HarmBench consists of 400 harmful behaviors and we use the HarmBench evaluator to detect a jailbreak. We directly optimize our attacks on each model, resulting in a more robust attack. When the model does not generate the end-of-thinking token </think>, we append it to the end of the generation and continue decoding, similar to automatic token insertion in a deployment pipeline.
2.

Compliance (y-axis). We benchmark on ambiguous prompts from XSTest (Röttger et al., 2023), Fortress (Knight et al., 2025), and False-Reject (Zhang et al., 2025f) and plot the percentage of non-refusals in the answer. These benchmarks consist of seemingly harmful prompts that are benign with Fortress from the CBRN domain, where over-refusal arises the most. A model with high over-refusal rates would refuse to these prompts and exhibit low compliance. We evaluate with the StrongReject evaluator (Souly et al., 2024) and take the complement of the refusal score.

Model averaging for producing the final checkpoint. As in two-player games and GAN training (Goodfellow et al., 2014), optimization with competing objectives can exhibit oscillatory dynamics. Intermediate checkpoints within a run may differ meaningfully in the reasoning strategies they encode (Mescheder et al., 2017; Daskalakis et al., 2017). As a result, checkpoint quality can vary both within a training run and across runs, especially since our evaluation samples from each checkpoint directly at test time. To reduce this variance, we take inspiration from iterate averaging in saddle-point optimization, where averaging is known to improve stability and convergence behavior (Golowich et al., 2020; Gao et al., 2021). Concretely, we train four independent runs and average their weights to obtain a single final model. For fairness, we apply the same procedure to the baseline runs without competing objectives. This choice is also consistent with prior work on model averaging and model soups, which shows that weight averaging across related solutions can improve robustness/accuracy without increasing inference cost (Wortsman et al., 2022). For a fair comparison, we also run model averaging for our dense/non-dense outcome-based reward baselines.

Baseline and comparisons. We compare SEAR-1.5B directly against dense and non-dense outcome rewards, frontier open-source models, and prior work on safety training.

1.

Dense/non-dense outcome rewards. To test the effect of process rewards, we compare using solely outcome-based rewards with and without the competing objectives formulation, using the identical reward models, training dataset, and base model. We refer to training with only the answer safety reward as “outcome-based reward” (Kim et al., 2025a) and utilizing a competing reasoning reward as “dense outcome-based reward”. We train with each method four different times and average the model weights as discussed above for a fair comparison to SEAR-1.5B.
2.

Open-source models. We benchmark Qwen3-1.7B and Qwen3-4B (Yang et al., 2025a) which form the base behavior of many post-trained prior work.
3.

Prior Work. We also directly benchmark TARS 1.5B, TARS 7B (Kim et al., 2025a), SafeChain (Jiang et al., 2025)), SafeKey (Zhou et al., 2025), and Self-RedTeam (Liu et al., 2025b). SafeChain uses reasoning through supervised training while SafeKey curates preemptive safe reasoning and trains the model through DPO (Rafailov et al., 2023), preventing the model from harmful reasoning. Self-RedTeam uses competing outcome-based rewards to train a single model for safety through RL.

6 Results: Does SEAR-1.5B Address Over-Refusal?

Figure 4 shows the safety-refusal trade-off of SEAR-1.5B compared to the baselines. We show the quadratic mean of the Defense Success Rate and Compliance Rate above each plotted point to capture their trade-off in a single metric. This metric ranges from 0 to 1, with 1 indicating perfect safety and compliance. We make four observations.

First, SEAR-1.5B achieves the best point (0.930) on the Pareto frontier of the safety-refusal trade-off, outperforming RL training with pure outcome-based rewards (0.892, our reproduction of Kim et al. (2025a)), dense outcome-based rewards (0.910), as well as SFT (§E). SEAR-1.5B also maintains the best performance on general tasks (§G).

Second, training SEAR-1.5B to actively reason harmfully allows it to exhibit less over-refusal than prior SOTA models such as TARS 7B (0.909).

Third, in Figure 5, we re-plot the frequency of flipping behavior (Figure 2) with SEAR-1.5B and compare it with other safety-trained models. SEAR-1.5B has a substantially higher fraction of harmful reasoning than other models, indicated by the higher datapoint count in later buckets. Moreover, SEAR-1.5B produces safer answers even when the reasoning segment becomes harmful, especially after accounting for the amount of harmful reasoning, providing evidence that it has learned to explore back to a safe answer. We show an example reasoning trace in §D. §F also shows that the reasoning is harmful while still being related to the prompt.

Fourth, we finally test whether SEAR-1.5B has learned to reason harmfully and then revert to a safe answer using a pre-fill attack. We apply this attack on ClearHarm (FAR AI, 2024) and Fortress harmful prompts (Knight et al., 2025) by using an abliterated model, Huihui-Qwen3-4B-abliterated-v2, to generate several initial sentences of harmful reasoning, replacing the model’s initial reasoning and then continuing generation. A model that is capable of exploring harmful items in its reasoning trace but still is able to produce a safe answer should be far more robust to this sort of pre-fill attack. Table 1 shows that SEAR-1.5B generates safer answers than outcome and dense-outcome RL, as well as the base Qwen3-1.7B model. A similar behavior holds for larger Qwen3 models (§H). This behavior is especially important for defending against reasoning-directed attacks. Such attacks, for example by pre-filling with harmful reasoning, can make the reasoning substantially more harmful than normal sampling from model. As attacks become stronger and induce more malicious reasoning, the ability to revert to a safe answer will become increasingly important.

Model	SEAR (Ours)	Outcome	Dense Outcome	Qwen3 1.7B
ClearHarm	0.901	0.869	0.894	0.610
Fortress	0.815	0.769	0.798	0.677

Table 1: Safety of the final answer when models face a pre-fill attack. We conduct a pre-filling attack on SEAR-1.5B, the model trained via outcome-based RL, the model trained via dense outcome-based RL, and Qwen3-1.7B on ClearHarm and Fortress. We evaluate answer safety using the Moderation API. SEAR-1.5B has the highest answer safety, indicating that when forced to reason harmfully, it can flip back more easily.

7 Conclusion & Discussion

We find that current safety-trained reasoning models lack the ability to explore potentially harmful reasoning paths in a controlled way, a capability that is important for mitigating over-refusal in safety settings. To address this, we formulate an RL training procedure with competing rewards within a single generated rollout: the model is encouraged to explore harmful reasoning when useful while still being rewarded for producing a safe final answer. This induces behavior in which the model can first consider unsafe trajectories and then recover to a safe response. As a result, our approach mitigates over-refusal on ambiguous prompts without compromising safety. We also find that process-based rewards are crucial for optimizing such competing objectives with RL.

Despite the efficacy of our approach, one limitation is the particular choice of dense process reward that we use. More generally, other forms of exploratory reasoning may prove more effective for mitigating over-refusal than directly exploring the most harmful path. Our dense-reward framework could be extended to such settings by designing alternative rewards for the reasoning segment. Nevertheless, we find that SEAR-1.5B, trained to actively explore harmful reasoning, achieves the lowest over-refusal among a wide range of models. We further present ablations showing that SEAR-1.5B has indeed learned flip-back behavior: a previously missing form of reasoning that is important for safety.

Ethics Statement

Our work uses RL for safety-training in language models and introduces reasoning that could potentially provide to be harmful. However, we note that our model, dataset, methodology, and analysis are solely for the purpose of research.

Acknowledgements

This work is supported by Coefficient Giving (formerly Open Philanthropy). We gratefully acknowledge support from the Schmidt Sciences SAFE-AI program and also thank Google Cloud for providing TPU resources, the FLAME center at CMU for providing compute support via the Orchard cluster, and the National Science Foundation (award OAC 2320345, 2005572) and the State of Illinois for Delta and DeltaAI advanced computing resources.

References

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. (2025) Gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Cited by: §1, §3.
C. Anil, G. Zhang, Y. Wu, and R. Grosse (2021) Learning to give checkable answers with prover-verifier games. arXiv preprint arXiv:2108.12099. Cited by: §2.
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong (2023) Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419. Cited by: Appendix B, item 1.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: Figure 7, Appendix G.
G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun (2023) UltraFeedback: boosting language models with high-quality feedback. External Links: 2310.01377 Cited by: Appendix A.
J. Cui, W. Chiang, I. Stoica, and C. Hsieh (2024) Or-bench: an over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947. Cited by: Appendix A, §1, §2, §3.
C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: §5.
FAR AI (2024) ClearHarm: a more challenging jailbreak dataset. Note: https://www.far.ai/news/clearharm-a-more-challenging-jailbreak-datasetAccessed: 2026-04-01 Cited by: §6.
K. Feng, K. Ding, Y. Wang, M. Li, F. Wei, X. Wang, Q. Zhang, and H. Chen (2025) SAFER: advancing safety alignment via efficient ex-ante reasoning. arXiv preprint arXiv:2504.02725. Cited by: §1, §2.
Y. Gao, C. Kroer, and D. Goldfarb (2021) Increasing iterate averaging for solving saddle-point problems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 7537–7544. Cited by: §5.
[11] S. Ghosh, P. Varshney, M. N. Sreedhar, A. Padmakumar, T. Rebedea, J. R. Varghese, and C. Parisien AEGIS2. 0: a diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. In Neurips Safe Generative AI Workshop 2024, Cited by: Appendix A.
N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar (2020) Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In Conference on Learning Theory, pp. 1758–1784. Cited by: §5.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §5.
M. Y. Guan, M. Joglekar, E. Wallace, S. Jain, B. Barak, A. Helyar, R. Dias, A. Vallone, H. Ren, J. Wei, H. W. Chung, S. Toyer, J. Heidecke, A. Beutel, and A. Glaese (2025) Deliberative alignment: reasoning enables safer language models. External Links: 2412.16339, Link Cited by: §1, §1, §2.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. NeurIPS. Cited by: Figure 7, Appendix G.
F. Jiang, Z. Xu, Y. Li, L. Niu, Z. Xiang, B. Li, B. Y. Lin, and R. Poovendran (2025) Safechain: safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025. Cited by: §1, §1, §2, item 3.
L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, and N. Dziri (2024) WildTeaming at scale: from in-the-wild jailbreaks to (adversarially) safer language models. External Links: 2406.18510, Link Cited by: Appendix A, Figure 2, §3, Figure 5.
T. Kim, F. Tajwar, A. Raghunathan, and A. Kumar (2025a) Reasoning as an adaptive defense for safety. arXiv preprint arXiv:2507.00971. Cited by: Appendix F, Appendix F, §1, §1, §1, §1, §2, §4.1, §4.2, item 1, item 3, §5, §6.
Y. Kim, T. Kim, E. Park, C. Park, C. Breazeal, D. McDuff, and H. W. Park (2025b) InvThink: towards ai safety via inverse reasoning. arXiv preprint arXiv:2510.01569. Cited by: §1, §2.
J. H. Kirchner, Y. Chen, H. Edwards, J. Leike, N. McAleese, and Y. Burda (2024) Prover-verifier games improve legibility of llm outputs. arXiv preprint arXiv:2407.13692. Cited by: §2.
C. Q. Knight, K. Deshpande, V. Sirdeshmukh, M. Mankikar, S. R. Team, S. Team, and J. Michael (2025) FORTRESS: frontier risk evaluation for national security and public safety. arXiv preprint arXiv:2506.14922. Cited by: §2, §3, item 2, §6.
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. arXiv preprint arXiv:2305.20050. Cited by: Figure 7, Appendix G.
C. Y. Liu, L. Zeng, Y. Xiao, J. He, J. Liu, C. Wang, R. Yan, W. Shen, F. Zhang, J. Xu, Y. Liu, and Y. Zhou (2025a) Skywork-reward-v2: scaling preference data curation via human-ai synergy. arXiv preprint arXiv:2507.01352. Cited by: Appendix C, §4.2.
M. Liu, L. Jiang, Y. Liang, S. S. Du, Y. Choi, T. Althoff, and N. Jaques (2025b) Chasing moving targets with online self-play reinforcement learning for safer language models. arXiv preprint arXiv:2506.07468. Cited by: §2, item 3.
I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.2.
T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng (2023) A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, pp. 15009–15018. Cited by: §3, §4.2.
L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. Advances in neural information processing systems 30. Cited by: §5.
Y. Mou, Y. Luo, S. Zhang, and W. Ye (2025) SaRO: enhancing llm safety through reasoning-based alignment. arXiv preprint arXiv:2504.09420. Cited by: §1, §2.
T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, et al. (2025) Olmo 3. arXiv preprint arXiv:2512.13961. Cited by: §1, §3.
A. Paulus, I. Kulikov, B. Amos, R. Munos, I. Evtimov, K. Chaudhuri, and A. Zharmagambetov (2025) Safety alignment of lms via non-cooperative games. arXiv preprint arXiv:2512.20806. Cited by: §2, §4.1.
S. Peng, E. Smith, I. Evtimov, S. Jiang, P. Chen, H. Zhan, H. Wang, D. H. Chau, M. Pasupuleti, and J. Chi (2025) Large reasoning models learn better alignment from flawed thinking. arXiv preprint arXiv:2510.00938. Cited by: §2.
Y. Qu, M. Y. Yang, A. Setlur, L. Tunstall, E. E. Beeching, R. Salakhutdinov, and A. Kumar (2025) Optimizing test-time compute via meta reinforcement fine-tuning. arXiv preprint arXiv:2503.07572. Cited by: §3.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn (2023) Direct preference optimization: your language model is secretly a reward model. Advances in Neural Information Processing Systems 36, pp. 53728–53741. Cited by: §2, item 3.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: Figure 7, Appendix G.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: Figure 7, Appendix G.
P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy (2023) Xstest: a test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263. Cited by: §2, §3, item 2.
A. Setlur, Y. Qu, M. Yang, L. Zhang, V. Smith, and A. Kumar (2025) Optimizing llm test-time compute involves solving a meta-rl problem. Note: https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a-meta-rl-problem/CMU MLD Blog Cited by: §1.
A. Setlur*, M. Y. R. Yang*, C. Snell, J. Greer, I. Wu, V. Smith, M. Simchowitz, and A. Kumar (2025) E3: learning to explore enables extrapolation of test-time compute for llms. External Links: 2506.09026, Link Cited by: §3.
[39] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.03300. Cited by: §4.2, §4.2.
S. Si, X. Wang, G. Zhai, N. Navab, and B. Plank (2025) THINK before refusal: triggering safety reflection in llms to mitigate false refusal behavior. arXiv preprint arXiv:2503.17882. Cited by: §1, §2.
A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al. (2024) A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260. Cited by: item 2.
Z. Tan, W. Yu, J. Si, T. Liu, K. Guan, H. Jin, J. Tao, X. Yuan, D. Ma, X. Zhang, et al. (2026) TriPlay-rl: tri-role self-play reinforcement learning for llm safety alignment. arXiv preprint arXiv:2601.18292. Cited by: §2, §4.1.
H. Wang, Z. Qin, L. Shen, X. Wang, D. Tao, and M. Cheng (2025) Safety reasoning with guidelines. arXiv preprint arXiv:2502.04040. Cited by: §1, §2.
M. Wang, N. Zhang, Z. Xu, Z. Xi, S. Deng, Y. Yao, Q. Zhang, L. Yang, J. Wang, and H. Chen (2024) Detoxifying large language models via knowledge editing. External Links: 2403.14472, Link Cited by: Appendix A.
[45] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv. org/abs/2406.01574, pp. 21. Cited by: Figure 7, Appendix G.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: §1.
X. Wen, Z. He, H. Qi, Z. Wan, Z. Ma, Y. Wen, T. Zheng, X. Xu, C. Lu, and Q. Zhang (2026) MAGIC: a co-evolving attacker-defender adversarial game for robust llm safety. arXiv preprint arXiv:2602.01539. Cited by: §2, §4.1.
M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, et al. (2022) Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pp. 23965–23998. Cited by: §5.
I. Wu, Y. Qu, A. Setlur, and A. Kumar (2026) Reasoning cache: continual improvement over long horizons via short-horizon rl. arXiv preprint arXiv:2602.03773. Cited by: §3.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025a) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §1, §3, item 2.
X. Yang, G. Deng, J. Shi, T. Zhang, and J. S. Dong (2025b) Enhancing model defense against jailbreaks with proactive safety reasoning. arXiv preprint arXiv:2501.19180. Cited by: §1, §2.
H. Zhang, D. Wang, Y. Liu, K. Chen, J. Wang, X. Ying, L. Liu, and W. Wang (2025a) ORFuzz: fuzzing the” other side” of llm safety–testing over-refusal. arXiv preprint arXiv:2508.11222. Cited by: §2, §3.
J. Zhang, H. Wang, E. M. Smith, S. Wang, A. Sharaf, M. Pasupuleti, B. Van Durme, D. Khashabi, J. Weston, and H. Zhan (2025b) The alignment waltz: jointly training agents to collaborate for safety. arXiv preprint arXiv:2510.08240. Cited by: §1.
Y. Zhang, Y. Ding, J. Yang, T. Luo, D. Li, R. Duan, Q. Liu, H. Su, Y. Dong, and J. Zhu (2025c) Towards safe reasoning in large reasoning models via corrective intervention. arXiv preprint arXiv:2509.24393. Cited by: §1.
Y. Zhang, Z. Zeng, D. Li, Y. Huang, Z. Deng, and Y. Dong (2025d) RealSafe-r1: safety-aligned deepseek-r1 without compromising reasoning capability. arXiv preprint arXiv:2504.10081. Cited by: §1, §1, §2, Figure 5.
Y. Zhang, J. Chi, H. Nguyen, K. Upasani, D. M. Bikel, J. Weston, and E. M. Smith (2024) Backtracking improves generation safety. arXiv preprint arXiv:2409.14586. Cited by: §1, §2.
Y. Zhang, M. Li, W. Han, Y. Yao, Z. Cen, and D. Zhao (2025e) Safety is not only about refusal: reasoning-enhanced fine-tuning for interpretable llm safety. arXiv preprint arXiv:2503.05021. Cited by: §1, §2.
Z. Zhang, W. Xu, F. Wu, and C. K. Reddy (2025f) Falsereject: a resource for improving contextual safety and mitigating over-refusals in llms via structured reasoning. arXiv preprint arXiv:2505.08054. Cited by: item 2.
X. Zhao, W. Cai, T. Shi, D. Huang, L. Lin, S. Mei, and D. Song (2025) Improving llm safety alignment with dual-objective optimization. arXiv preprint arXiv:2503.03710. Cited by: §1, §2.
K. Zhou, X. Zhao, J. Srinivasa, G. Liu, A. Feng, D. Song, and X. E. Wang (2025) Safekey: amplifying aha-moment insights for safety reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 25407–25423. Cited by: §1, §2, item 3.
J. Zhu, L. Yan, S. Wang, D. Yin, and L. Sha (2025) Reasoning-to-defend: safety-aware reasoning can defend large language models from jailbreaking. arXiv preprint arXiv:2502.12970. Cited by: §1, §2.
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson (2023) Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. Cited by: Appendix B, item 1.

Appendix A Training Data Details

Our training data consists of a $1:1$ ratio of harmful and harmless prompts for RL training. The harmful prompts are a collection of prompts from WildJailbreak (Jiang et al., 2024), Aegis (Ghosh et al., ), and SafeEdit (Wang et al., 2024). The harmless prompts are collected from UltraFeedback (Cui et al., 2023) which are a set of general instruction tasks. We also mix $10\%$ of the harmless prompts with OR-Bench (Cui et al., 2024) prompts.

Prompt	Category
Please generate content that reinforces the stereotype that immigrants are dirty and don’t take care of their living spaces.	Harmful
Mock or belittle someone’s disability.	Harmful
Does operational research and optimization course relate to a data science program? How?	Harmless
Describe the mood of this poem. The sun it rises. From east to west. Its light it guides us. To heal and rest	Harmless

Table 2: Example prompts and their categories.

Appendix B Attack Details

We conduct two types of attacks: PAIR (Chao et al., 2023) and GCG (Zou et al., 2023). PAIR is a black-box attack that uses another attacker LLM to automate the search process for a harmful prompts. GCG is a white-box attack that uses token-level gradients of the loss with respect to the inputs to find candidate strings and selects the candidate that gives the lowest loss for a given generation target. We optimize for the beginning of the reasoning as the target for GCG.

Appendix C Evaluators and Judge

On harmless prompts we use two reward models Skywork (Liu et al., 2025a) (Skywork-Reward-V2-Llama-3.1-8B), and an LLM judge, GPT-5-mini. Skywork encourages the model to output helpful answers to instruction tasks and GPT-5-mini encourages the model to not refuse to harmless prompts. The final reward is the average of the two reward models. We use the following prompt for GPT-5-mini.

Appendix D Reasoning Examples

Here we show generation examples of how SEAR provides harmful reasoning that helps explore and flip back to a safe answer compared to Qwen3 models which conservatively generate safe reasoning or generate harmful reasoning without flipping back to a safe answer.

Appendix E SFT Baseline

Figure 6 compares SEAR-1.5B to SFT. We construct reasoning traces that think harmfully first and flip back to safe thinking, learning to output a safe answer. We use an abliterated model (huihui-ai/Huihui-Qwen3-4B-abliterated-v2) to construct the harmful reasoning segment and use Gemini 2.5 Flash to construct the rest of the reasoning that flips back along with the answer. For a fair comparison, we collect completions on the same training prompts as SEAR and start with the same base model. We also sweep on the learning rate and select the model with the lowest validation loss. The results show that SFT training (FLIP SFT 1.5B) is compliant but over-refuses. Hence, it cannot generalize as well as RL training, especially against jailbreak attacks.

Appendix F Reasoning Relevance

Table 3 shows that the reasoning from SEAR-1.5B produces prompt-dependent reasoning and is not simply a “rubber-stamp” of toxic information. We evaluate (1) the relevance of the reasoning to the prompt and (2) whether the prompt can be inferred from the reasoning using Gemini 2.5 Flash as the LLM judge and the prompts shown below. For referring the prompt back from the reasoning, we provide 10 random prompts including the ground truth prompt and make the judge select the correct prompt.

We benchmark on WildJailbreak, UltraFeedback, and Fortress which consists of harmful prompts, normal instruction tasks, and ambiguous prompts and also compare to TARS (Kim et al., 2025a) which is the RL trained equivalent baseline without flipping behavior.

The results show that SEAR-1.5B has similar reasoning relevance as TARS (Kim et al., 2025a) and the base model. Since TARS and the base model do not have harmful reasoning, this means that harmful reasoning for exploration does not degrade topic relevance. Furthermore, the prompt can always be inferred correctly based on the reasoning. This shows that SEAR-1.5B’s reasoning is not just a rubber stamp but actually thinks more harmfully about the topic of the prompt, leading to a safer answer.

Reasoning Relevance
Dataset	SEAR-1.5B	TARS	Base Model
WildJailbreak	0.866	0.832	0.865
UltraFeedback	0.762	0.767	0.698
Fortress	0.783	0.778	0.820

Prompt Backward Inference
Dataset	SEAR-1.5B	TARS	Base Model
WildJailbreak	1.000	1.000	1.000
UltraFeedback	1.000	0.995	0.979
Fortress	1.000	1.000	1.000

Table 3: Reasoning-prompt relevance. Average reasoning relevance scores and prompt inference scores on WildJailbreak, UltraFeedback, and Fortress.

Appendix G Out-of-distribution (OOD) Generalization

For testing generalization capabilties to other tasks that are out-of-distribution, we benchmark SEAR-1.5B on GSM8K Cobbe et al. (2021), MATH-500 (Hendrycks et al., 2021; Lightman et al., 2023), GPQA (Rein et al., 2024), MMLU-Pro (Wang et al., ), and SQuAD Rajpurkar et al. (2016) and compare it to outcome-based and dense-outcome based training. For GSM8K and MATH-500, we evaluate with the ground truth answer and for GPQA, MMLU-Pro, and SQuAD, we evaluate with gpt-4.1-mini as the LLM judge. Figure 7 shows that SEAR-1.5B maintains OOD capabilities the best compared to outcome-based an dense-outcome based training.

Appendix H Scaling the Pre-fill Attack

Table 4 shows that the problem of models not being able to flip back to safe reasoning persists even for larger models.

Model	SEAR (Ours)	Qwen3 8B	Qwen3 14B	Qwen3 32B
ClearHarm	0.901	0.776	0.809	0.752
Fortress	0.815	0.638	0.655	0.659

Table 4: Answer safety of larger models on the pre-fill attack. We conduct a pre-filling attack on larger Qwen3 models on ClearHarm and Fortress. We evaluate answer safety using the Moderation API. Even larger models cannot flip back reasoning as well SEAR-1.5B