arXiv is now an independent nonprofit! Learn more
License: CC BY 4.0
arXiv:2606.32017v1 [cs.LG] 30 Jun 2026

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Yuanda Xu1   Zhengze Zhou111footnotemark: 1  Hejian Sang111footnotemark: 1  Xiaomin Li2
Jiaxin Zhang3   Xinchen Du1   Zhipeng Wang1   Alborz Geramifard1
1LinkedIn Corporation  2Harvard University  3Johns Hopkins University
Equal contribution. Hejian Sang’s work was done while at LinkedIn Corporation.Correspondence to yuanda@math.princeton.eduWork done during an internship at LinkedIn Corporation.
Abstract

Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone—a projection of the per-segment advantage residual onto the role variable—so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4%10.4\% and 14.8%14.8\% relative to GRPO.

1 Introduction

Reinforcement learning with verifiable rewards has become a standard recipe for improving language-model reasoning and agentic behavior (Shao et al., 2024; DeepSeek-AI et al., 2025; Trung et al., 2024; Yu et al., 2025; Xu et al., 2026a). In Group Relative Policy Optimization (GRPO), a policy samples multiple trajectories for a prompt, receives final rewards from a verifier, and assigns relative advantages to the sampled outputs. This recipe is attractive because it requires no learned value model and optimizes directly against the deployment policy. However, when the output is an agentic trajectory rather than a single answer, the central credit-assignment question changes: which environment-facing actions deserve credit when supervision arrives only as a final verifier outcome?

The unit of decision in this setting is not an arbitrary token span. It is an environment-facing segment: a search query, click, file edit, command, object interaction, or tool call that changes either the external state or the agent’s information state. Across WebShop, Search-QA, and ALFWorld, such segments range from decisive actions (final purchases, answer submissions, object placements) to information-gathering actions (searches, inspections, reads) and low-value infrastructure (repeated navigation or redundant clicks).

Refer to caption
Figure 1: Core results. Across two policy models and three agentic benchmarks (ALFWorld, Search-QA, WebShop), TRIAGE consistently improves over the GRPO baseline (dashed line). Bar labels report mean success rate; vertical axes are truncated per panel to make differences visible.

Outcome credit is therefore useful but structurally incomplete. Standard GRPO treats all segments equally within a trajectory: if the trajectory succeeds, all action tokens are reinforced; if it fails, all are suppressed. This creates two systematic blind spots. First, failed rollouts can contain useful exploratory actions that should not inherit the full negative outcome credit. Second, successful rollouts can contain redundant or harmful actions that should not inherit positive credit merely because the agent later recovered. Final outcome tells us whether the trajectory solved the task, but it cannot say what local role each segment played.

Recent credit-assignment and on-policy supervision methods address parts of this problem. State-anchored estimators compare actions from matched states; process reward models learn dense progress signals; outcome-statistical methods estimate whether recurring segments concentrate in successful or failed rollouts; and token-importance methods reweight supervision within sampled outputs (Wang et al., 2025; Lu et al., 2026; Xu et al., 2026b). These approaches are useful, but they usually score each segment without specifying its semantic role: task progress, belief-state progress, harmless infrastructure, and regression should not receive the same credit rule. We test this distinction directly by comparing against two dense-signal controls—a scalar LLM process-reward baseline with the same judge and context window, and an outcome-supervised shared-backbone value baseline—so the empirical question is not whether dense segment rewards help, but whether role typing adds information beyond them.

Our central claim is therefore: agentic RL needs a role axis in addition to an outcome axis. The most important distinction is that exploration is not no-progress. Exploration often has zero immediate task progress and may appear in both successful and failed trajectories. A purely outcome-statistical estimator can under-credit it because exploratory actions are not always success-specific. A generic process scorer can also conflate exploration with no-progress when no subgoal is completed immediately. Yet suppressing exploration is precisely how sparse-reward agent training becomes brittle: the policy learns to avoid information-gathering actions before it has enough information to act decisively.

We propose TRIAGE, a simple framework for role-aware credit estimation. Like medical triage, which sorts patients by the kind of attention they need before allocating treatment, TRIAGE first sorts each environment-facing segment into a semantic role before deciding how much credit it should inherit from the trajectory outcome. TRIAGE uses a structured LLM judge as a role classifier, not as an unconstrained reward model. Given a bounded local context around each segment, the judge assigns one primary role: decisive progress, useful exploration, no-progress infrastructure, or regression. The RL algorithm then maps roles to different credit rules. Decisive progress receives strong outcome-aligned credit, useful exploration receives bounded positive credit, no-progress infrastructure is dampened toward zero, and regression is suppressed even when it appears in an otherwise successful trajectory.

This design deliberately separates semantic diagnosis from optimization direction. An LLM is well suited to answering local questions such as whether an action inspected a relevant file, narrowed a search, damaged state, or repeated known information. It is less suited to replacing the verifier. TRIAGE therefore keeps the GRPO outcome advantage as the base training signal and uses the role classifier only to add bounded process rewards or penalties at the segment level.

We make four contributions:

  1. 1.

    We identify two structural blind spots of outcome-only segment credit—useful exploration in failed rollouts and regression inside successful rollouts—and define a four-role taxonomy that adds a semantic role axis to trajectory-level outcome credit.

  2. 2.

    We introduce TRIAGE, a role-conditioned credit assignment framework that uses a structured LLM judge for semantic role typing while keeping the GRPO outcome advantage as the source of optimization direction.

  3. 3.

    We give a theoretical justification: role-conditioned credit is the MSE-optimal segment correction measurable from role labels alone, the fixed role constants reduce advantage estimation error whenever they are aligned with this optimum, and this connects to unbiased, lower-variance policy gradients (Section 4.1).

  4. 4.

    We empirically evaluate TRIAGE across diverse agentic tasks and show consistent gains over GRPO, scalar judge-derived process rewards, and an outcome-supervised value baseline, while using manually labeled segments and role diagnostics to explain when the improvement comes from exploration retention, infrastructure damping, or regression suppression.

Refer to caption
Figure 2: Overview of TRIAGE. Rollouts are split into environment-facing segments, a structured judge assigns semantic roles, and role-conditioned process rewards adjust segment-level GRPO advantages.

2 Problem Setup: Segment Credit in Agentic RL

GRPO.

Given a task prompt xx, GRPO samples GG trajectories, scores each with a verifier ri=V(τi){0,1}r_{i}=V(\tau_{i})\in\{0,1\}, and assigns the group-normalized advantage AiGRPO=(rir¯)/(σr+ϵ)A_{i}^{\mathrm{GRPO}}=(r_{i}-\bar{r})/(\sigma_{r}+\epsilon) uniformly to every token in the trajectory. Some environment logs report raw success rewards on a different scale, such as 10 for success and 0 for failure; throughout training and in all equations, we binarize these raw rewards to ri{0,1}r_{i}\in\{0,1\}.

From outcome credit to segment credit.

An agentic trajectory τi=(ai,1,oi,1,,ai,Ki,oi,Ki)\tau_{i}=(a_{i,1},o_{i,1},\ldots,a_{i,K_{i}},o_{i,K_{i}}) consists of environment-facing action segments ai,ka_{i,k} and their resulting observations oi,ko_{i,k}. Broadcasting a single AiGRPOA_{i}^{\mathrm{GRPO}} to all segments treats a decisive purchase click, a useful diagnostic read, a harmless no-op, and a wrong edit identically. Process reward models offer one response by learning a dense value or progress score for each step (Lightman et al., 2024), but they do not by themselves specify whether a segment is exploration, infrastructure, or regression. Our goal is a segment-level advantage Ai,kA_{i,k} that reflects not only how good a segment is, but what role it plays—which requires a structured label rather than a role-agnostic score.

3 Why Outcome Credit Is Structurally Incomplete

Outcome credit supplies the correct trajectory-level direction, but it is a one-axis signal. It partitions rollouts into success and failure, then assigns all local decisions the same sign within each rollout. Agentic trajectories need a second axis: the local semantic role of each segment. Table 1 shows the two conflict cells that motivate this paper. A useful segment in a failed rollout should not be fully punished, and a regressive segment in a successful rollout should not inherit positive credit.

Table 1: Outcome-only credit has two conflict cells. Final success or failure gives the optimization direction for the whole trajectory, but local segment roles determine whether a segment should inherit that direction unchanged.
Local segment role Successful rollout Failed rollout
Useful local segment should receive positive credit should not be fully punished
Regressive local segment should not inherit success credit should be suppressed

We instantiate this missing role axis with four segment types. Define a role variable

ρi,k={D,E,N,R},\rho_{i,k}\in\mathcal{R}=\{D,E,N,R\}, (1)

where DD denotes decisive progress, EE useful exploration, NN no-progress infrastructure, and RR regression. Table 2 gives the operational definition.

Table 2: Role taxonomy
Role Definition Examples Default credit rule
Decisive (DD) Produces verifier-checkable progress or completes a necessary subgoal take target; buy item; correct answer strong outcome-aligned
Exploration (EE) Reveals relevant state without immediate completion read test; constrained search; inspect container conditional positive
No-progress (NN) Changes neither task state nor belief state, but is harmless duplicate click after completion; empty traversal slightly penalized
Regression (RR) Corrupts state or repeats without information gain wrong edit; wrong purchase; repeated examine/click negative

The taxonomy is intentionally not just an ordering by amount of progress. Exploration is not merely a small amount of progress. It is a different type of progress: it improves the information state rather than the environment state. This matters because many agent tasks are partially observable. Before editing a file, the agent must inspect relevant code and tests. Before buying an item, it must search and compare. Before manipulating an object, it may need to discover where the object or receptacle is. These actions should not be treated like repeated boilerplate just because they do not immediately satisfy the final verifier.

Role boundaries.

The role boundaries are defined by what the segment changes. DD changes verifier-checkable task state: taking the target object, selecting the required item, submitting the correct answer, or applying the edit that makes a test pass. EE changes the information state without yet completing a subgoal: opening a container, reading a failing test, or running a targeted search. This boundary can be blurry in hindsight because an exploratory action may enable a later decisive one, but we reserve DD for direct task-state progress and use EE for first-time, reasonable information collection.

NN and RR cover the cases that should not receive positive progress credit. NN is harmless infrastructure that changes neither task state nor information state, such as an empty traversal or a generic command that does not affect the next decision. RR is locally harmful or redundant without information gain: a wrong edit, wrong purchase, corrupted object state, or repeated inspection/click after the relevant information is already known. Final outcome cannot resolve these distinctions. Useful exploration can appear in failed trajectories, and regression can appear in successful ones after later recovery, so role-aware credit must judge the local segment rather than only its trajectory-level success label.

What the judge must get right.

The judge does not need perfect D/ED/E boundary agreement. Its key capability is asymmetric error correction: in successful rollouts, find local regressions that should not inherit positive credit; in failed rollouts, find locally useful segments that should not inherit full negative credit. Operationally, regression has two subclasses: state corruption (wrong edit, wrong purchase, wrong object) and redundant-without-information-gain (repeated inspection or click after the information is already known).

Implications for diagnostics.

The taxonomy also determines what we measure experimentally. Useful exploration is outcome-mixed: it appears in both successful and failed rollouts, so outcome association can make it look neutral or negative. No-progress infrastructure receives nonzero advantage under uniform broadcasting, wasting gradient on boilerplate actions. Regression can appear inside successful trajectories after later recovery, so final outcomes hide local harm. We therefore track three diagnostics in the experiments: exploration retention, infrastructure damping, and regression suppression.

4 TRIAGE: Role-Conditioned Segment Credit

TRIAGE has two components: a structured role judge and a role-conditioned process reward. The policy update remains the standard GRPO update. Rather than using the LLM judge as an unconstrained scalar reward model, TRIAGE uses a rubric-guided judge to assign one auditable semantic role per segment, and maps those roles to fixed credit rules. The only change is the advantage assigned to each environment-facing segment: we keep the trajectory-level GRPO advantage and add a bounded process reward whose form depends on the segment role.

Role-judge context window.

The training-time role judge uses a bounded local context window around each segment; in our experiments this window includes up to five previous and five future action–observation pairs. Appendix H gives the exact window definition. The judge does not receive the final verifier outcome.

Let AiGRPOA_{i}^{\mathrm{GRPO}} be the outcome advantage for trajectory ii. For segment ai,ka_{i,k}, TRIAGE defines

Ai,kTRIAGE=AiGRPO+λcρ^i,k,A_{i,k}^{TRIAGE}=A_{i}^{\mathrm{GRPO}}+\lambda c_{\hat{\rho}_{i,k}}, (2)

where cρ^i,kc_{\hat{\rho}_{i,k}} is a fixed process reward for the assigned role and λ\lambda controls how strongly this local signal is mixed into the GRPO advantage. The auxiliary judge scores are used only to help choose the role label, not as additional training-time notation.

A simple instantiation sets

(cD,cE,cN,cR)=(1,0.5,0.1,0.5).(c_{D},c_{E},c_{N},c_{R})=(1,0.5,-0.1,-0.5). (3)

Thus decisive progress receives a unit process reward, useful exploration receives a smaller positive reward, no-progress infrastructure receives only a small step cost, and regression receives a larger local penalty even if the trajectory succeeds. This scale follows the usual agent-RL convention that task progress is around +1+1, harmless inefficiency receives a mild penalty around 0.1-0.1, and clearly unhelpful actions receive a stronger negative reward. This keeps the main comparison close to GRPO: the dominant signal is still the outcome advantage, while role typing adds only a bounded segment-level process reward.

Unless otherwise stated, we use λ=0.4\lambda=0.4 for Search-QA and λ=0.2\lambda=0.2 for the other two environments, keeping (cD,cE,cN,cR)=(1,0.5,0.1,0.5)(c_{D},c_{E},c_{N},c_{R})=(1,0.5,-0.1,-0.5) fixed across tasks. The role constants are never tuned; the only tuned hyperparameter is λ\lambda, selected on the training split by training success rate with the test set held out for final evaluation. The λ×|cR|\lambda\times|c_{R}| grids in Appendix F are post-hoc sensitivity analyses and were not used to choose λ\lambda.

For stability, the resulting segment advantages are whitened within each batch before being broadcast to segment tokens:

A~i,kTRIAGE=Ai,kTRIAGEμσ+ϵ.\tilde{A}_{i,k}^{TRIAGE}=\frac{A_{i,k}^{TRIAGE}-\mu_{\mathcal{B}}}{\sigma_{\mathcal{B}}+\epsilon}. (4)

The policy update is the usual clipped GRPO objective with A~i,kTRIAGE\tilde{A}_{i,k}^{TRIAGE} assigned to tokens belonging to segment kk. In the evaluated environments, a segment coincides with the standard environment step used in prior agent-RL work: one admissible ALFWorld command, one WebShop search[...] or click[...] action, or one Search-QA \langlesearch\rangle query or final \langleanswer\rangle submission. The segment advantage is applied only to generated tokens in the corresponding environment-facing turn; prompt and observation tokens are excluded from the policy loss.

Training procedure.

In each GRPO batch, we first compute the usual trajectory advantage AiGRPOA_{i}^{\mathrm{GRPO}}. We then split each rollout into environment-facing action segments and ask the role judge for the segment role and auxiliary scores (q,u,h,b)(q,u,h,b). The role-conditioned process reward is added to the GRPO advantage, the resulting segment advantages are normalized within the batch, and each normalized value is broadcast to the tokens in that segment before the standard clipped GRPO update. No judge is used at evaluation time.

4.1 Theoretical Justification: Role Conditioning as an Optimal Projection

We give a justification, not a guarantee: under a stated sufficiency assumption, role-conditioned credit is the best segment-level correction expressible from role labels alone, and the fixed constants used by TRIAGE inherit a strictly smaller estimation error than uniform broadcasting whenever aligned with this optimum. We connect this to lower-variance policy gradients and flag where the assumption fails in Appendix B; all proofs are in Appendix A.

Setup.

Let Ai,kA_{i,k}^{*} denote the (unobserved) oracle per-segment advantage and let AiGRPOA_{i}^{\mathrm{GRPO}} be the trajectory advantage that GRPO broadcasts to every segment. Define the credit residual

δi,kAi,kAiGRPO,\delta_{i,k}\triangleq A_{i,k}^{*}-A_{i}^{\mathrm{GRPO}}, (5)

the within-trajectory variation in true credit that uniform broadcasting discards. A segment-level estimator that adds a correction gg to AiGRPOA_{i}^{\mathrm{GRPO}} incurs squared error 𝔼[(AiGRPO+gAi,k)2]=𝔼[(gδi,k)2]\mathbb{E}\big[(A_{i}^{\mathrm{GRPO}}+g-A_{i,k}^{*})^{2}\big]=\mathbb{E}\big[(g-\delta_{i,k})^{2}\big].

Proposition 1 (Optimal role-measurable correction).

Among all corrections g(ρ)g(\rho) that are measurable with respect to the segment role ρi,k\rho_{i,k}, the minimizer of the segment-advantage MSE is the conditional expectation of the residual,

g(ρ)=𝔼[δi,k|ρi,k=ρ],g^{\star}(\rho)=\mathbb{E}\big[\delta_{i,k}\,\big|\,\rho_{i,k}=\rho\big], (6)

and the resulting MSE reduction relative to GRPO is

MSEGRPOMSEg=𝔼[(𝔼[δi,kρi,k])2] 0.\mathrm{MSE}^{\mathrm{GRPO}}-\mathrm{MSE}^{g^{\star}}=\mathbb{E}\Big[\big(\mathbb{E}[\delta_{i,k}\mid\rho_{i,k}]\big)^{2}\Big]\;\geq\;0. (7)

Proposition 1 formalizes the paper’s central claim: role labels help exactly to the extent that they explain nonzero credit residual, i.e. whenever 𝔼[δρ]0\mathbb{E}[\delta\mid\rho]\neq 0 for some role. The four-role taxonomy is thus an interpretable, coarse discretization of the Bayes-optimal correction gg^{\star}, with g(R)<0g^{\star}(R)<0 (regression is over-credited by broadcasting) and g(E)>0g^{\star}(E)>0 in failed rollouts (exploration is over-punished)—precisely the two conflict cells of Table 1.

TRIAGE uses fixed role constants rather than estimating gg^{\star}. For the correction λcρ^\lambda c_{\hat{\rho}}, the MSE change relative to GRPO is

ΔMSE=λ2𝔼[cρ^2]2λCov(cρ^,δ),\Delta_{\mathrm{MSE}}=\lambda^{2}\mathbb{E}[c_{\hat{\rho}}^{2}]-2\lambda\,\mathrm{Cov}(c_{\hat{\rho}},\delta), (8)

so any positively aligned role signal reduces error for sufficiently small λ\lambda. This is exactly the desired sign pattern: negative for regression that GRPO over-credits and positive for exploration that GRPO over-punishes. Appendix B gives the full fixed-constant condition, connects the correction to policy-gradient variance, and states the failure modes.

5 Experiments

We design experiments to test role-aware credit rather than merely final performance. The central empirical question is whether TRIAGE preserves useful exploration while suppressing no-progress and regression.

5.1 Experimental Setup

Environments.

We evaluate on three families of agentic tasks. ALFWorld tests embodied household planning with templated actions (Shridhar et al., 2021). Search-QA tests multi-turn retrieval and answer generation, where query formulation and evidence gathering are exploratory (Dunn et al., 2017). WebShop tests product search and purchase (Yao et al., 2022), where search/filter actions are exploratory and the purchase action is decisive.

Models and training.

We evaluate Qwen2.5-7B-Instruct and Qwen3-1.7B-Instruct as deployable student policies for all three environments (Yang et al., 2024). Training uses GRPO with GG rollouts per prompt, implemented on top of the verl framework (Sheng et al., 2025). TRIAGE uses the same rollouts and verifier rewards as GRPO, plus cached role labels from an LLM judge. All final evaluations use the unaided deployment policy without judge calls. For ALFWorld and WebShop, we repeat training and evaluation with ten independent runs and report mean ±\pm sample standard deviation. Search-QA runs are substantially more expensive because each optimization step requires large-model rollout with multi-turn retrieval and verifier evaluation, so Search-QA results are reported from a single run under the same fixed training configuration; consequently Search-QA entries in the tables do not include a standard deviation.

5.2 Main Results

Table 3: Main results: success rate (%). ALFWorld and WebShop entries with ±\pm are mean ±\pm sample standard deviation over ten independent training-and-evaluation runs. Search-QA is reported as a single run because the retrieval-augmented rollout loop makes repeated full training runs substantially more expensive. The “no evidence” rows use the same Qwen3-8B-thinking judge but with a prompt that does not ask for a per-segment evidence string and only requests the role label (see Appendix H for the full default prompt that does require evidence).
Model Method ALFWorld Search-QA WebShop
Qwen2.5-7B-Instruct GRPO 79.6±1.979.6\pm 1.9 43.3 70.1±2.370.1\pm 2.3
TRIAGE w/ Qwen3-8B no-think judge 76.8±2.176.8\pm 2.1 45.0 65.4±2.865.4\pm 2.8
TRIAGE w/ Qwen3-8B-thinking judge, no evidence 83.1±2.683.1\pm 2.6 46.4 73.5±2.573.5\pm 2.5
TRIAGE w/ Qwen3-8B-thinking judge 87.5±2.487.5\pm 2.4 48.148.1 77.2±2.277.2\pm 2.2
Qwen3-1.7B-Instruct GRPO 45.2±2.545.2\pm 2.5 39.4 37.5±2.837.5\pm 2.8
TRIAGE w/ Qwen3-8B no-think judge 40.7±2.140.7\pm 2.1 40.2 35.1±2.235.1\pm 2.2
TRIAGE w/ Qwen3-8B-thinking judge, no evidence 51.8±1.851.8\pm 1.8 41.1 49.6±1.949.6\pm 1.9
TRIAGE w/ Qwen3-8B-thinking judge 56.4±1.256.4\pm 1.2 42.3 55.9±1.555.9\pm 1.5

Figure 1 summarizes the main comparison, and Table 3 reports the underlying numbers. With the default Qwen3-8B-thinking judge, TRIAGE improves over GRPO on all three benchmarks for both policies, with the largest gains on ALFWorld and WebShop—the two audited environments with the highest regression mass (48% and 43%; Appendix G). The Search-QA gain is smaller but consistent, matching its more exploration-dominated, lower-regression profile. This pattern is what role-conditioned credit predicts: most of the benefit comes from withholding positive credit from regressive segments that vanilla GRPO reinforces whenever the trajectory happens to succeed.

The comparison also shows that the benefit depends on judge reliability rather than on simply adding a dense reward. Substituting the Qwen3-8B no-think judge—which collapses on the RR-in-success cell (Table 4)—drives TRIAGE below the GRPO baseline on ALFWorld and WebShop for both policies, confirming that the gains stem from accurate role typing and not from the extra reward term alone. Removing the evidence requirement (“no evidence” rows) keeps TRIAGE above GRPO but consistently trails the default prompt, so thinking is necessary for the hard RR-in-success cell while structured evidence acts as a low-cost calibration knob on top of it.

5.3 Does the Judge Recover the Conflict Cells?

Because TRIAGE relies on a role judge, we audit whether the judge recovers local segment roles rather than simply echoing the final outcome. Two annotators independently label 135 environment-facing segments from 18 logged trajectories (3 ALFWorld, 3 WebShop, 12 Search-QA), reaching 88.1% raw agreement; disagreements are adjudicated by a senior annotator and used as ground truth. The prompt, labels, and examples are in Appendix H.

Table 4 reports binary F1 by role–outcome cell, focusing on the two conflict cells: RR inside successful rollouts and EE inside failed rollouts. We omit DD in failed rollouts because it has zero support in this labeled set.

Table 4: Qwen3 role judge F1 (%) across 135 labeled segments, split by hand-labeled role and trajectory outcome. Column counts give the number of positive examples in each cell.
Config RR in success rollouts RR in failed rollouts EE in success rollouts EE in failed rollouts DD in success rollouts
(n=35)(n=35) (n=20)(n=20) (n=29)(n=29) (n=21)(n=21) (n=25)(n=25)
8B no-think 29.2 81.1 56.1 90.0 55.6
8B think 86.1 91.9 78.7 95.2 65.1
14B no-think 5.7 80.0 54.0 90.9 62.9
14B think 72.7 86.5 70.8 90.9 56.4
32B no-think 35.9 74.3 56.4 82.6 73.7
32B think 88.6 83.3 70.8 88.9 65.1

The result supports the two-blind-spot framing. Thinking is not uniformly useful; its large effect is concentrated in RR-in-success, where it raises F1 from roughly 24 to 82 averaged over model sizes. The easy cell is EE-in-failure (F1 >82>82 even without thinking); the hard cell is finding regression exactly where the verifier says the rollout succeeded. Scaling helps less than enabling thinking: 8B-thinking is within three F1 points of 32B-thinking on RR-in-success at substantially lower inference cost. We therefore use Qwen3-8B with thinking enabled as the default judge.

5.4 Comparisons and Ablations

All comparisons and ablations in this section use Qwen2.5-7B-Instruct. We organize the analysis around three questions: how TRIAGE compares with stronger credit-assignment baselines, whether role typing adds value beyond generic dense process rewards, and whether the trained policy exhibits the intended behavioral changes.

External credit-assignment baselines.

Table 5 situates TRIAGE against stronger credit-assignment baselines reproduced under an identical protocol: PPO with a learned critic, GiGPO, which assigns step-level credit by grouping actions from recurring states (Feng et al., 2025), and a shared-backbone value baseline that learns a dense per-segment signal from the same verifier rewards. TRIAGE improves over PPO on all three benchmarks without a separate value network. Relative to GiGPO, TRIAGE is higher on WebShop and statistically tied on ALFWorld, while GiGPO does not apply to Search-QA because its state grouping degenerates when per-step states almost never recur. Relative to the value baseline, TRIAGE tests the central claim of the paper: dense segment credit alone is not enough when productive and regressive actions have similar outcome-trained values, and the missing information is the segment’s semantic role. The key difference is signal source: GiGPO derives micro-advantages structurally from recurring states, the value baseline derives them statistically from outcome regression, and TRIAGE derives them semantically from role labels—targeting the conflict cells that role-agnostic dense signals cannot resolve.

Table 5: Comparison with stronger credit-assignment baselines on Qwen2.5-7B-Instruct: success rate (%). All methods are our own runs under an identical protocol (see Appendix D for the shared-backbone value baseline). The GRPO and TRIAGE rows are repeated from Table 3 for reference. GiGPO’s Search-QA entry is left blank because its step-level state grouping degenerates to episode-level GRPO when per-step states embed retrieved documents that almost never recur across rollouts.
Method ALFWorld Search-QA WebShop
GRPO 79.6±1.979.6\pm 1.9 43.3 70.1±2.370.1\pm 2.3
PPO 81.7±2.181.7\pm 2.1 45.3 71.5±2.071.5\pm 2.0
GiGPO 87.8±2.287.8\pm 2.2 74.3±2.974.3\pm 2.9
Shared-backbone value baseline (App. D) 85.2±2.785.2\pm 2.7 46.8 70.8±3.770.8\pm 3.7
TRIAGE w/ Qwen3-8B-thinking judge 87.5±2.487.5\pm 2.4 48.1\mathbf{48.1} 77.2±2.2\mathbf{77.2\pm 2.2}

The shared-backbone value baseline improves over GRPO on the two longer-rollout environments (ALFWorld 79.685.279.6\rightarrow 85.2, +5.6+5.6; Search-QA 43.346.843.3\rightarrow 46.8, +3.5+3.5), confirming that a learned dense per-segment baseline trained on the same verifier reward is a meaningful upgrade over uniform broadcast. On WebShop, however, it barely moves (70.170.870.1\rightarrow 70.8, within run-to-run variance), while TRIAGE reaches 77.277.2. The reason is structural: WebShop regressions are repeated clicks of an already-selected attribute that leave the observation almost unchanged, so an outcome-trained value head cannot separate the productive click from its redundant repeat, whereas the role classifier reads the action history and labels the repeat RR. Appendix D gives the full analysis.

Role-reward ablations.

We also include a scalar process-reward baseline to separate the value of role typing from the value of adding any judge-derived dense reward. This baseline uses the same Qwen3-8B-thinking judge and the same local context window as TRIAGE, but asks for a single progress score si,k[1,1]s_{i,k}\in[-1,1] rather than a discrete role. We add this score to the GRPO advantage as

Ai,k=AiGRPO+λsi,k,A_{i,k}=A_{i}^{\mathrm{GRPO}}+\lambda s_{i,k}, (9)

and apply the same batch whitening as TRIAGE. This controls for judge access, local context, and dense reward shaping while removing role-conditioned credit rules. Thus the comparison isolates whether the advantage comes from a generic process reward or from the role-specific mapping that treats exploration, no-progress infrastructure, and regression differently.

Table 6: Ablation results on Qwen2.5-7B-Instruct: success rate (%). We test role-reward components, focusing on the frequent/high-impact failure modes: regression and exploration.
Ablation Change ALFWorld Search-QA WebShop
Raw GRPO no role judge or process reward 79.6±1.979.6\pm 1.9 43.3 70.1±2.370.1\pm 2.3
Scalar process reward 8B-thinking judge, no role typing 84.8±2.884.8\pm 2.8 45.9 72.1±2.872.1\pm 2.8
No regression penalty set cR=0c_{R}=0 81.4±2.681.4\pm 2.6 46.7 73.1±2.373.1\pm 2.3
No exploration bonus set cE=0c_{E}=0 85.8±2.485.8\pm 2.4 47.5 75.5±2.275.5\pm 2.2
TRIAGE none 87.5±2.4\mathbf{87.5\pm 2.4} 48.1\mathbf{48.1} 77.2±2.2\mathbf{77.2\pm 2.2}

Table 6 isolates the two role-reward components and the role-typing effect itself. The scalar process-reward baseline improves over GRPO, confirming that dense segment feedback is useful, but it remains below TRIAGE on every benchmark. Removing either role component further degrades TRIAGE, so the gain is not an artifact of simply adding a dense reward from the same judge. The regression penalty (cRc_{R}) is the dominant contributor: zeroing it costs 1.81.86.16.1 points across benchmarks and leaves ALFWorld and WebShop only marginally above raw GRPO. The exploration bonus (cEc_{E}) provides a smaller but consistently positive top-up (0.60.61.71.7 points). This ordering matches the role audit: ALFWorld and WebShop carry regression mass of 48%\approx 48\% and 43%\approx 43\% (Appendix G), so most of TRIAGE’s gain comes from suppressing RR credit inside successful trajectories. Consistent with this mechanism, TRIAGE also reduces completed-rollout length by 10.4%10.4\% and 14.8%14.8\% relative to GRPO on the two environments (Appendix E). TRIAGE is stable to the role-constant scale and λ\lambda within a reasonable range (Appendix F).

6 Discussion and Limitations

Limitations.

Role labels are semantic estimates, not ground truth. A judge can overvalue plausible exploration, miss subtle regressions, or rely too much on final outcomes. TRIAGE mitigates this by using the judge only for structured role diagnosis and keeping verifier outcomes as the base optimization signal, but it does not remove judge error.

Role usefulness is also context-dependent. The same search, read, or test command can be informative once and redundant later, so the classifier must condition on local state and redundancy rather than action strings alone. Finally, role-aware credit is not causal identification: it improves local attribution, but counterfactual environment interventions would be needed to prove that a segment was necessary.

Future work.

This paper uses one primary role per segment to keep the signal auditable. A natural extension is a soft role distribution, e.g., (pD,pE,pN,pR)(p_{D},p_{E},p_{N},p_{R}), with credit computed as an expectation under role-specific constants. This could better represent mixed segments, such as a search that reveals useful evidence while also introducing distractors, but it would require reliable calibration and stronger audit procedures.

TRIAGE is also compatible with segment bucketing and outcome-statistical estimators. Bucketing can decide which segments share statistical evidence, while role labels decide how that evidence should be interpreted. Combining the two is a promising direction for domains where exact action arguments are sparse and repeated segments are rare.

Finally, the discrete four-role label is only the first layer of role-aware judging. On harder tasks or stronger base agents, obvious loops, wrong purchases, and repeated inspections become rare, and the credit problem shifts from detecting coarse failures to estimating how much each segment advances the task or belief state. In that regime the same framework can use a stronger judge to assign finer-grained process rewards within each role rather than a single discrete label.

7 Related Work

Table 7: Where TRIAGE sits among agentic credit-assignment methods. Expl. \neq no-prog.: separates useful exploration from harmless no-progress; Regr. in success: can withhold credit from regressive steps inside successful rollouts; No state match: works without recurring or matchable states.
Method family Granularity Expl. \neq no-prog. Regr. in success No state match
Outcome / group RL trajectory
GiGPO (Feng et al., 2025) step partial
Step / process rewards (Wang et al., 2025; Lightman et al., 2024) step
TRIAGE (ours) step

Agentic credit assignment.

Agentic RL requires assigning credit across environment-facing decisions rather than only across tokens. Table 7 summarizes the closest design choices. State-anchored methods such as GiGPO compare actions taken from matched states (Feng et al., 2025); stepwise progress and process-reward methods learn scalar dense scores for intermediate steps (Wang et al., 2025; Lightman et al., 2024). TRIAGE is complementary: it keeps the outcome advantage but adds a semantic role label, so the update can distinguish useful exploration from no-progress behavior and regression from ordinary low progress.

Process reward models and LLM judges.

Process reward models provide dense supervision by scoring intermediate reasoning or agent steps (Lightman et al., 2024). LLM-as-judge methods can evaluate generated outputs, critique trajectories, or assign rubric scores (Shinn et al., 2023; Madaan et al., 2023; Fang et al., 2026). Unstructured process scores can be brittle: they may punish correct actions in failed trajectories, over-credit plausible narration, or conflate exploration with lack of progress. TRIAGE uses the judge more narrowly as a structured classifier over segment roles. This reduces the burden on the judge and makes the resulting signal easier to audit.

Exploration in language agents.

Language agents often rely on information-gathering actions such as search, inspect, read, and test execution (Yao et al., 2023b; Schick et al., 2023). Related prompting and self-improvement methods also exploit multiple sampled reasoning paths, search trees, or self-generated rationales to expose useful intermediate information (Wang et al., 2023; Yao et al., 2023a; Zelikman et al., 2022). These actions change the agent’s belief state rather than immediately completing the task. In sparse-reward RL, such actions are easy to misclassify as neutral or wasteful. TRIAGE makes belief-state progress an explicit credit category, allowing training to preserve useful exploration while still suppressing redundant or irrelevant exploration.

On-policy distillation and token weighting.

On-policy distillation and token-importance methods refine supervision on sampled trajectories (Xu et al., 2026b, a; Agarwal et al., 2024). These methods mostly operate at token or response granularity. TRIAGE operates at the agentic segment level and can be applied to either RL advantages or distillation losses: role labels can gate which action turns receive strong distillation or reinforcement.

8 Conclusion

We argued that agentic credit assignment requires distinguishing what role each environment-facing segment plays. The key missing distinction is that exploration is not no-progress: an action can improve the agent’s belief state without immediately completing a subgoal. TRIAGE operationalizes this idea with a structured role judge and role-conditioned credit rules, keeping the GRPO outcome advantage as the optimization direction while adding a bounded, role-typed correction. Across ALFWorld, Search-QA, and WebShop, this lifts success rates over GRPO for two policy models—by up to 7.97.9 points on Qwen2.5-7B and 18.418.4 on Qwen3-1.7B—and shortens completed rollouts by 10.4%10.4\%14.8%14.8\%, with ablations and a manual role audit confirming that suppressing regression inside successful trajectories is the dominant source of the gain. Theoretically, role-conditioned credit is the MSE-optimal correction expressible from role labels alone, so the benefit is tied directly to judge reliability, which our audit measures rather than assumes. By reinforcing decisive progress, preserving useful exploration, damping no-progress infrastructure, and suppressing regression, TRIAGE offers a principled path toward sparse-reward RL for agents whose success depends on information gathering and recovery.

References

  • R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem (2024) On-policy distillation of language models: learning from self-generated mistakes. In International Conference on Learning Representations, Cited by: §7.
  • DeepSeek-AI, D. Guo, et al. (2025) DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Nature 645, pp. 633–638. External Links: Document Cited by: §1.
  • M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho (2017) SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179. Cited by: §5.1.
  • J. Fang, Z. Hong, M. Zheng, M. Song, G. Li, H. Jiang, D. Zhang, H. Guo, X. Wang, and T. Chua (2026) Rubric-based on-policy distillation. arXiv preprint arXiv:2605.07396. Cited by: §7.
  • L. Feng, Z. Xue, T. Liu, and B. An (2025) Group-in-group policy optimization for LLM agent training. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.4, §7, Table 7.
  • E. Greensmith, P. L. Bartlett, and J. Baxter (2004) Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, pp. 1471–1530. Cited by: Appendix B.
  • H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2024) Let’s verify step by step. In International Conference on Learning Representations, Cited by: §2, §7, §7, Table 7.
  • Z. Lu, Z. Yao, Z. Han, Z. Wang, J. Wu, Q. Gu, X. Cai, W. Lu, J. Xiao, Y. Zhuang, and Y. Shen (2026) Self-distilled agentic reinforcement learning. arXiv preprint arXiv:2605.15155. Cited by: §1.
  • A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark (2023) Self-refine: iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651. Cited by: §7.
  • T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023) Toolformer: language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761. Cited by: §7.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), Cited by: Appendix B.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix D.
  • Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo (2024) DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1.
  • G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2025) HybridFlow: a flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys), Cited by: §5.1.
  • N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao (2023) Reflexion: language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366. Cited by: §7.
  • M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. Hausknecht (2021) ALFWorld: aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, Cited by: §5.1.
  • L. Trung, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li (2024) ReFT: reasoning with reinforced fine-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, pp. 7601–7614. External Links: Document Cited by: §1.
  • H. Wang, C. T. Leong, J. Wang, J. Wang, and W. Li (2025) SPA-RL: reinforcing LLM agents via stepwise progress attribution. arXiv preprint arXiv:2505.20732. Cited by: §1, §7, Table 7.
  • P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui (2024) Math-Shepherd: verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: Appendix D.
  • X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023) Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations, Cited by: §7.
  • Y. Xu, H. Sang, Z. Zhou, R. He, Z. Wang, and A. Geramifard (2026a) Beyond GRPO and on-policy distillation: an empirical sparse-to-dense reward principle for language-model post-training. arXiv preprint arXiv:2605.12483. Cited by: §1, §7.
  • Y. Xu, H. Sang, Z. Zhou, R. He, Z. Wang, and A. Geramifard (2026b) TIP: token importance in on-policy distillation. arXiv preprint arXiv:2604.14084. Cited by: §1, §7.
  • A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu (2024) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §5.1.
  • S. Yao, H. Chen, J. Yang, and K. Narasimhan (2022) WebShop: towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, Cited by: §5.1.
  • S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan (2023a) Tree of thoughts: deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, Cited by: §7.
  • S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023b) ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations, Cited by: §7.
  • Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang (2025) DAPO: an open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §1.
  • E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman (2022) STaR: bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465. Cited by: §7.

Appendix A Additional Theory and Proofs

Proof of Proposition 1.

Minimizing 𝔼[(g(ρ)δ)2]\mathbb{E}[(g(\rho)-\delta)^{2}] over all ρ\rho-measurable gg is an L2L_{2} projection of δ\delta onto the subspace of ρ\rho-measurable functions; the minimizer is the conditional expectation g(ρ)=𝔼[δρ]g^{\star}(\rho)=\mathbb{E}[\delta\mid\rho]. Uniform GRPO is the special case g0g\equiv 0, with MSE 𝔼[δ2]\mathbb{E}[\delta^{2}]. By the law of total variance, 𝔼[δ2]𝔼[(δg)2]=𝔼[(𝔼[δρ])2]0\mathbb{E}[\delta^{2}]-\mathbb{E}[(\delta-g^{\star})^{2}]=\mathbb{E}[(\mathbb{E}[\delta\mid\rho])^{2}]\geq 0. ∎

Proposition 2 (MSE reduction under fixed constants).

With the fixed role correction Ai,kTRIAGE=AiGRPO+λcρ^i,kA_{i,k}^{TRIAGE}=A_{i}^{\mathrm{GRPO}}+\lambda c_{\hat{\rho}_{i,k}}, the batch MSE satisfies

MSETRIAGE=MSEGRPO+λ2σc22λCov(cρ^,δ),\mathrm{MSE}^{TRIAGE}=\mathrm{MSE}^{\mathrm{GRPO}}+\lambda^{2}\sigma_{c}^{2}-2\lambda\,\mathrm{Cov}(c_{\hat{\rho}},\,\delta), (10)

with σc2=𝔼[cρ^2]\sigma_{c}^{2}=\mathbb{E}[c_{\hat{\rho}}^{2}]. TRIAGE strictly reduces MSE iff Cov(cρ^,δ)>0\mathrm{Cov}(c_{\hat{\rho}},\delta)>0 and 0<λ<2Cov(cρ^,δ)/σc20<\lambda<2\,\mathrm{Cov}(c_{\hat{\rho}},\delta)/\sigma_{c}^{2}, with optimum λ=Cov(cρ^,δ)/σc2\lambda^{\star}=\mathrm{Cov}(c_{\hat{\rho}},\delta)/\sigma_{c}^{2} and maximal reduction Cov2(cρ^,δ)/σc2\mathrm{Cov}^{2}(c_{\hat{\rho}},\delta)/\sigma_{c}^{2}.

Proof of Proposition 2.

Expand (AiGRPO+λcρ^A)2=(λcρ^δ)2=δ2+λ2cρ^22λcρ^δ(A_{i}^{\mathrm{GRPO}}+\lambda c_{\hat{\rho}}-A^{*})^{2}=(\lambda c_{\hat{\rho}}-\delta)^{2}=\delta^{2}+\lambda^{2}c_{\hat{\rho}}^{2}-2\lambda c_{\hat{\rho}}\delta and average over the batch; the correction λ2σc22λCov\lambda^{2}\sigma_{c}^{2}-2\lambda\mathrm{Cov} is a convex quadratic in λ\lambda, minimized at λ\lambda^{\star}. ∎

Appendix B Extended Theoretical Discussion

This appendix expands the short discussion following Proposition 2: why the fixed constants should align with the residual, how the correction connects to policy-gradient variance, and when the argument fails.

Alignment of fixed constants.

The covariance Cov(cρ^,δ)\mathrm{Cov}(c_{\hat{\rho}},\delta) is maximized when the role constants match the sign pattern of the optimal correction g(ρ)=𝔼[δρ]g^{\star}(\rho)=\mathbb{E}[\delta\mid\rho]. In the two conflict cells, this means assigning negative credit to RR segments inside successful trajectories, which GRPO would otherwise over-credit, and positive credit to useful EE segments inside failed trajectories, which GRPO would otherwise over-punish. The constants (cD,cE,cN,cR)=(1,0.5,0.1,0.5)(c_{D},c_{E},c_{N},c_{R})=(1,0.5,-0.1,-0.5) implement this ordering without per-environment tuning.

From estimation error to policy-gradient variance.

The target of training is policy improvement, not estimation accuracy per se. The bridge is standard: in policy-gradient estimators, adding any action-history–measurable baseline to the advantage leaves the gradient unbiased while changing its variance, and the variance-minimizing baseline is the conditional expectation of the return [Greensmith et al., 2004, Schulman et al., 2016]. Role labels are functions of the local action–observation window, hence admissible baselines; Proposition 1 identifies the role-measurable correction that minimizes residual energy, and Proposition 2 shows the fixed-constant surrogate reduces it whenever aligned. Because TRIAGE additionally whitens within the batch (Eq. 4), only the sign and relative ordering of the correction must be correct—an order-preserving transform of an aligned correction remains aligned (Appendix F).

Remark 3 (Where the assumption fails).

The benefit hinges on role labels capturing a nontrivial share of the credit residual (𝔼[δρ]0\mathbb{E}[\delta\mid\rho]\neq 0) and on the judge recovering ρ\rho accurately enough to keep Cov(cρ^,δ)>0\mathrm{Cov}(c_{\hat{\rho}},\delta)>0. Both can fail: (i) role is only a partial summary of local credit, so gg^{\star} leaves residual error—e.g. the D/ED/E boundary is genuinely ambiguous (Table 4, DD-in-success F1 65\approx 65); and (ii) an unreliable judge can drive Cov(cρ^,δ)0\mathrm{Cov}(c_{\hat{\rho}},\delta)\leq 0, in which case no λ>0\lambda>0 helps. This is the theoretical counterpart of the no-think judge degrading TRIAGE below GRPO (Table 3) and of the degradation at large λ\lambda and |cR||c_{R}| (Table 10). We therefore present these results as a justification conditional on judge reliability, which our audit (Section 5, Appendix H) measures directly rather than assumes.

Appendix C Training Hyperparameters

Table 8: Training hyperparameters. Here η\eta is the learning rate, GG is the number of rollouts per prompt, Steps is the number of optimization steps, ϵ\epsilon is the GRPO clip ratio, λ\lambda is the role-reward mixing coefficient, β\beta is unused by TRIAGE, αKL\alpha_{\mathrm{KL}} is the KL coefficient, cρc_{\rho} denotes role-reward constants, LpL_{\mathrm{p}} is the maximum prompt length, LrL_{\mathrm{r}} is the maximum response length, and BB is the PPO mini-batch size.
Method η\eta GG Steps ϵ\epsilon λ\lambda β\beta αKL\alpha_{\mathrm{KL}} cρc_{\rho} LpL_{\mathrm{p}} LrL_{\mathrm{r}} BB
GRPO 10610^{-6} 8 150 0.2 0.01 4096 512 64
TRIAGE 10610^{-6} 8 150 0.2 0.2-0.4 0.01 (1,0.5,0.1,0.5)(1,0.5,-0.1,-0.5) 4096 512 64

Computational overhead.

TRIAGE adds an LLM judge call per segment during training, which increases per-batch wall-clock time. However, the relevant comparison is not raw compute parity but whether the same compute spent on additional GRPO training yields equivalent gains. In our experiments, the GRPO baseline is already near saturation at 150 steps: extending training to 300 steps yields ALFWorld success below 85% and WebShop below 75%, still short of the TRIAGE results (87.587.5 and 77.277.2 respectively). The performance plateau is expected because the credit-assignment bottleneck is structural—broadcasting a single trajectory advantage over 10–30 segments dilutes gradient regardless of how many optimization steps are taken—and more steps cannot fix a noisy per-segment signal.

From a long-rollout perspective, the LLM judge is also structurally advantageous in several respects: (i) credit dilution worsens with trajectory length, so the marginal value of correct per-segment attribution grows with the number of segments; (ii) unlike a learned value critic (as in PPO), the LLM judge generalizes zero-shot across environments without requiring environment-specific training data or reward-model fitting; and (iii) the judge leverages semantic reasoning about task goals, information gain, and state corruption that a scalar critic trained on sparse binary rewards cannot easily acquire. Thus, while the judge adds inference cost, it addresses a qualitatively different bottleneck than the one more training steps would solve.

Appendix D Shared-Backbone Value Baseline

To isolate the contribution of role typing from the contribution of any dense per-segment signal, we compare TRIAGE against a shared-backbone value baseline. This baseline keeps the GRPO policy update but attaches a learned scalar value head to the same policy backbone and trains it on the same on-policy rollouts. The recipe follows the standard actor–critic instantiation used in PPO-style RLHF [Schulman et al., 2017] and the outcome-supervised value learning popularized by Wang et al. [2024], adapted to the agentic segment setting.

Architecture.

The value head Vϕ:dmodelV_{\phi}:\mathbb{R}^{d_{\mathrm{model}}}\to\mathbb{R} is a single linear projection on top of the final-layer hidden state of the policy backbone, evaluated at the last token of each segment’s observation. The backbone is shared with the policy and kept frozen throughout training, so only ϕ\phi (a few thousand parameters) receives gradients. This avoids a separate critic network and keeps the additional wall-clock cost negligible relative to GRPO.

Labels: no extra annotation required.

We do not collect any process-level labels and do not call an external judge. The value head is supervised on per-segment discounted Monte-Carlo returns derived from the same binary verifier reward GRPO already computes,

yi,k=γTikri,ri=V(τi){0,1},y_{i,k}=\gamma^{T_{i}-k}\,r_{i},\qquad r_{i}=V(\tau_{i})\in\{0,1\}, (11)

where TiT_{i} is the number of environment-facing segments in trajectory ii. The head is trained by mean-squared regression V(ϕ)=1Ni,k(Vϕ(si,k)yi,k)2\mathcal{L}_{V}(\phi)=\tfrac{1}{N}\sum_{i,k}\big(V_{\phi}(s_{i,k})-y_{i,k}\big)^{2} jointly with each GRPO step on the freshly collected rollouts. This is the same outcome-only supervision Math-Shepherd-style PRMs use, but with the policy backbone shared rather than a separate model fitted on logged data.

Mixing into GRPO.

At credit-assignment time the head’s per-segment value increment is added to the trajectory advantage and whitened with the same batch statistics as TRIAGE before broadcasting to segment tokens:

Ai,k=AiGRPO+λ(Vϕ¯(si,k)Vϕ¯(si,k1)),A_{i,k}=A_{i}^{\mathrm{GRPO}}+\lambda\big(V_{\bar{\phi}}(s_{i,k})-V_{\bar{\phi}}(s_{i,k-1})\big), (12)

where ϕ¯\bar{\phi} is an exponential-moving-average copy of ϕ\phi used to decouple value updates from policy updates.

Hyperparameters.

GRPO parameters (η\eta, GG, optimization steps, clip ratio ϵ\epsilon, KL coefficient αKL\alpha_{\mathrm{KL}}, LpL_{\mathrm{p}}, LrL_{\mathrm{r}}, batch size BB) are shared with TRIAGE (Table 8). Value-head–specific settings: discount γ=0.95\gamma=0.95 for ALFWorld and WebShop and γ=0.9\gamma=0.9 for Search-QA (reflecting its shorter answer-terminating rollouts); head learning rate ηV=104\eta_{V}=10^{-4}; 10-step head warmup at λ=0\lambda=0 so ϕ\phi converges to a reasonable baseline before being injected into the policy update; EMA target update rate τ=0.99\tau=0.99; per-segment value increment clipped to [0.5,0.5][-0.5,0.5] to bound early-training noise; mixing coefficient λ\lambda matched to TRIAGE’s value per benchmark (λ=0.4\lambda=0.4 on Search-QA, λ=0.2\lambda=0.2 on ALFWorld and WebShop), so any performance difference reflects the source of the dense signal rather than its scale.

What this baseline isolates.

Both TRIAGE and the shared-backbone value baseline add a bounded, λ\lambda-scaled dense per-segment correction on top of the same GRPO advantage; both whiten within the batch; both use only labels that the GRPO loop already produces (verifier rewards alone for the value baseline, verifier rewards plus role labels from a small judge for TRIAGE). The remaining methodological difference is the source of the per-segment signal: a learned scalar critic regressing trajectory-level outcomes, versus a semantic role classifier with role-conditioned credit rules. Table 5 shows that the value baseline improves over GRPO on the two longer-rollout environments (ALFWorld 79.685.279.6\rightarrow 85.2, +5.6+5.6; Search-QA 43.346.843.3\rightarrow 46.8, +3.5+3.5) but barely moves WebShop (70.170.870.1\rightarrow 70.8, well inside run-to-run variance), while TRIAGE reaches 87.587.5/48.148.1/77.277.2. The per-benchmark gap to TRIAGE (2.3-2.3/1.3-1.3/6.4-6.4) is largest precisely on WebShop, where regressions take the form of re-clicks of an already-selected attribute that leave the visible observation almost unchanged; the value head therefore receives near-identical Monte-Carlo targets for the productive click and its redundant repeat and credits them near-identically, while the role classifier reads the action history and labels the repeat RR. The pattern is consistent with the intended interpretation: outcome-trained scalar critics capture coarse per-segment progress when the observation actually evolves, but cannot supply role-level distinctions in action spaces where harmful repetitions leave the local state intact.

Appendix E Rollout Efficiency

Because TRIAGE suppresses no-progress infrastructure and regression, trained policies should complete tasks with fewer environment-facing actions than GRPO. Table 9 measures rollout length as the number of action–observation segments per completed evaluation trajectory.

Table 9: Post-training rollout length on Qwen2.5-7B-Instruct.
Environment Starting policy GRPO length TRIAGE length
ALFWorld 43.9 24.45±1.8624.45\pm 1.86 21.90±2.0321.90\pm 2.03
WebShop 14.80±0.1814.80\pm 0.18 8.00±0.458.00\pm 0.45 6.82±0.246.82\pm 0.24

The length results show that both RL methods learn shorter trajectories than the starting policy, but TRIAGE removes more redundant interaction steps than GRPO. On ALFWorld, GRPO reduces the average completed-trajectory length from 43.943.9 to 24.4524.45 segments, while TRIAGE further reduces it to 21.9021.90, an additional 10.4%10.4\% reduction relative to GRPO. On WebShop, GRPO reduces rollout length from 14.8014.80 to 8.008.00 segments, while TRIAGE reaches 6.826.82, an additional 14.8%14.8\% reduction. This matches the intended mechanism of role-conditioned credit: suppressing repeated inspections, redundant attribute clicks, and other no-progress or regressive segments improves not only success rate but also interaction efficiency. The effect is especially important for long-horizon agents, where every unnecessary environment-facing step compounds inference cost and increases the opportunity for later mistakes.

Appendix F Sensitivity to Role Constants and λ\lambda

The main text fixes the role constants (cD,cE,cN,cR)=(1,0.5,0.1,0.5)(c_{D},c_{E},c_{N},c_{R})=(1,0.5,-0.1,-0.5) and tunes only the mixing coefficient λ\lambda per environment, with λ\lambda selected on the training split alone (Section 4). This appendix probes how sensitive TRIAGE is to these choices along the two axes that matter most for the conflict cells: the magnitude of the regression penalty |cR||c_{R}| and the overall mixing strength λ\lambda. The sweeps below are post-hoc diagnostics computed on the test set after λ\lambda was already fixed; they characterize robustness and were not used to select any reported hyperparameter.

All runs use Qwen2.5-7B-Instruct with the default Qwen3-8B-thinking judge; every other hyperparameter is held at its main-text value.

Joint λ×|cR|\lambda\times|c_{R}| sweep.

Table 10 sweeps λ{0.1,0.2,0.4}\lambda\in\{0.1,0.2,0.4\} against |cR|{0.25,0.5,1.0}|c_{R}|\in\{0.25,0.5,1.0\} on WebShop, keeping (cD,cE,cN)=(1,0.5,0.1)(c_{D},c_{E},c_{N})=(1,0.5,-0.1) fixed. The default configuration (λ=0.2\lambda=0.2, |cR|=0.5|c_{R}|=0.5) is highlighted.

Success rate is stable across the interior of the grid and degrades only at the corners, where either an overly large penalty (|cR|=1.0|c_{R}|=1.0) or an overly strong mixing (λ=0.4\lambda=0.4) begins to over-punish segments the judge mislabels as RR.

Table 10: WebShop success rate (%) on Qwen2.5-7B-Instruct under a joint sweep of the mixing coefficient λ\lambda and the regression-penalty magnitude |cR||c_{R}|. Entries are mean ±\pm sample standard deviation over ten runs. The default TRIAGE configuration (λ=0.2\lambda=0.2, |cR|=0.5|c_{R}|=0.5) is shown in bold. GRPO baseline: 70.1±2.370.1\pm 2.3.
|cR|=0.25|c_{R}|=0.25 |cR|=0.5|c_{R}|=0.5 |cR|=1.0|c_{R}|=1.0
λ=0.1\lambda=0.1 74.8±2.674.8\pm 2.6 75.6±2.475.6\pm 2.4 74.1±2.774.1\pm 2.7
λ=0.2\lambda=0.2 76.0±2.576.0\pm 2.5 77.2±2.2\mathbf{77.2\pm 2.2} 74.9±2.874.9\pm 2.8
λ=0.4\lambda=0.4 74.3±2.774.3\pm 2.7 74.6±2.974.6\pm 2.9 71.8±3.171.8\pm 3.1

Varying |cR||c_{R}| at the default λ\lambda.

Isolating |cR||c_{R}| at the per-environment default λ\lambda confirms the same robustness on the two environments not covered by the WebShop grid above. Extending the zero-penalty ablation of Table 6 to halved, default, and doubled penalties, ALFWorld success for |cR|{0,0.25,0.5,1.0}|c_{R}|\in\{0,0.25,0.5,1.0\} is 81.481.4/85.985.9/87.587.5/85.185.1 and Search-QA is 46.746.7/47.647.6/48.148.1/46.946.9, where |cR|=0|c_{R}|=0 reproduces the “no regression penalty” row of Table 6 and |cR|=0.5|c_{R}|=0.5 is the TRIAGE default. The corresponding WebShop trend is the λ=0.2\lambda=0.2 row of Table 10 (76.076.0/77.277.2/74.974.9 for |cR|{0.25,0.5,1.0}|c_{R}|\in\{0.25,0.5,1.0\}). In all three environments, halving |cR||c_{R}| retains most of the gain while doubling it stays above GRPO but begins to erode performance, consistent with heavier punishment of misjudged exploration in the more under-explored Search-QA setting.

Takeaway.

The sensitivity results support two conclusions. First, TRIAGE does not rely on a knife-edge choice of cRc_{R}: both the half-penalty and default settings remain well above GRPO and the cR=0c_{R}=0 ablation.

Second, performance degrades when the role correction becomes too aggressive, especially at larger λ\lambda and doubled |cR||c_{R}|, matching the expected failure mode of over-penalizing judge false positives for RR. We therefore use the default constants as a conservative operating point rather than as a heavily tuned optimum.

Interaction with batch whitening.

Equation (4) whitens the combined advantage Ai,kTRIAGE=AiGRPO+λcρ^i,kA_{i,k}^{TRIAGE}=A_{i}^{\mathrm{GRPO}}+\lambda c_{\hat{\rho}_{i,k}} within each batch before broadcasting it to tokens. A natural concern is that a batch containing many large negative RR corrections could shift μ\mu_{\mathcal{B}} and inflate σ\sigma_{\mathcal{B}} enough to undo the intended penalty.

Two properties bound this effect. First, whitening is an order-preserving affine map: subtracting a common μ\mu_{\mathcal{B}} and dividing by a positive σ\sigma_{\mathcal{B}} cannot reverse the relative ordering of two segments, so a segment that received a lower combined advantage because it was labeled RR still receives a lower normalized advantage than its non-RR peers in the same outcome group. The whitening rescales the magnitude of the correction but never flips its sign.

Second, the correction is deliberately small relative to the outcome advantage: with λ0.4\lambda\leq 0.4 and the audited role distribution, the role term contributes a raw standard deviation of only 0.090.090.280.28 (Section 4), so it perturbs rather than dominates μ\mu_{\mathcal{B}} and σ\sigma_{\mathcal{B}}.

Empirically, the interior stability of Table 10 confirms that whitening does not cancel the role signal across the operating range we use; degradation appears only when λ\lambda or |cR||c_{R}| is pushed to the grid corners, exactly where the unnormalized correction grows large enough to compete with the outcome advantage.

Appendix G Empirical Role Distribution Audit on Logged Trajectories

Setup.

We sampled six trajectories from production GRPO baseline runs of Qwen2.5-7B-Instruct: three from ALFWorld and three from WebShop. Trajectories were chosen to span the observed outcome distribution rather than randomly: a clean efficient success, a long success containing redundant action repeats, and (where available) a failure where the agent committed early to an incorrect product or container. These six trajectories are a subset of the hand-labeled set in Appendix H; we reuse its adjudicated per-segment role labels, which were produced by two annotators who did not participate in defining the four-role taxonomy of Section 3 (DD = decisive progress, EE = useful exploration, NN = no-progress infrastructure, RR = regression) and adjudicated by a senior annotator, and we apply that taxonomy to every environment-facing segment. The audit below focuses on ALFWorld and WebShop trajectories with complete per-segment logs; Search-QA examples are audited separately in Appendix H.

G.1 ALFWorld Trajectories

A1. Clean optimal trajectory.

Task: “put a clean butterknife in diningtable”. Outcome: success, 6 steps, raw environment reward 10. Role distribution: 5D+1E+0N+0R5D+1E+0N+0R. Table 11 shows the per-segment role assignment. This trajectory contains a single EE segment (the initial location guess) and five DD segments completing the task.

Vanilla GRPO: broadcasts AGRPO=+(rr¯)/σrA^{\mathrm{GRPO}}=+(r-\bar{r})/\sigma_{r} uniformly to all six segments. With no redundant or regressive segments to absorb credit, this is essentially the right behavior. TRIAGE: under the hand-audited roles, the role-conditioned rule adds λcD\lambda c_{D} to the five DD segments and λcE\lambda c_{E} to the initial EE. Net effect is a slight concentration of credit onto the decisive segments. This is the regime in which TRIAGE and vanilla GRPO behave nearly identically; the point of including this trajectory is to confirm that role-conditioning does not hurt when the trajectory is already efficient.

Table 11: Trajectory A1 (ALFWorld, clean success): per-segment hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 4/6). Roles: D = decisive progress, E = useful exploration, N = no-progress, R = regression.
tt Action Hand Justification Judge Agree
0 go to countertop 1 EE First location, no prior evidence of butterknife position EE \checkmark
1 take butterknife 2 from countertop 1 DD Target object acquired DD \checkmark
2 go to sinkbasin 1 DD Navigate to required clean facility EE ×\times
3 clean butterknife 2 with sinkbasin 1 DD Required transformation DD \checkmark
4 go to diningtable 1 DD Navigate to destination EE ×\times
5 move butterknife 2 to diningtable 1 DD Final placement, reward triggers DD \checkmark

A2. Lucky-recovery success.

Task: “put a toiletpaper in toiletpaperhanger”. Outcome: success in 22 steps, raw environment reward 10. Role distribution: 4D+7E+1N+10R4D+7E+1N+10R (Table 12). The agent does not find the target until step 17 and spends the prior 16 steps re-examining the same toilet, returning to already-visited locations, and repeating inventory checks. Ten segments are clearly redundant repeats meeting the operational definition of RR; the final four are DD completing the task; seven are EE (genuine first-time inspections that yielded information); one is NN (an empty-handed traversal).

Vanilla GRPO: applies positive AGRPOA^{\mathrm{GRPO}} uniformly to all 22 segments because the trajectory eventually succeeded. The 10 RR segments—repeated examine toilet 1, inventory, back-and-forth between two locations—all receive the same positive reinforcement as the four decisive DD segments at the end. This is exactly the failure mode above: success masks regression in hindsight credit. TRIAGE: under the hand-audited roles, steps 4, 6, 9–16 are RR and receive the negative process reward λcR\lambda c_{R} from Section 4, which lowers their segment advantage even though r=10r=10. The preserved positive credit concentrates on the four closing DD segments and the genuine EE segments earlier in the trajectory. Net effect: the trajectory contributes the same outcome signal but roughly 4/2218%4/22\approx 18\% of its segment positions carry the bulk of the gradient, against 22/22=100%22/22=100\% under vanilla GRPO.

Table 12: Trajectory A2 (ALFWorld, success with extensive redundancy), all 22 steps; hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 16/22).
tt Action Hand Justification Judge Agree
0 go to toilet 1 EE Semantic guess, no prior info EE \checkmark
1 examine toilet 1 EE First inspection, info gain EE \checkmark
2 inventory EE Initial hand check EE \checkmark
3 go to toiletpaperhanger 1 NN Empty-handed traversal EE ×\times
4 examine toilet 1 RR Repeat of step 1 RR \checkmark
5 examine toiletpaperhanger 1 EE First inspection of hanger EE \checkmark
6 examine toilet 1 RR Repeat RR \checkmark
7 go to countertop 1 EE New location attempt EE \checkmark
8 examine countertop 1 EE First inspection EE \checkmark
9 go to toilet 1 RR Repeated return, no new info RR \checkmark
10 examine toilet 1 RR Repeat RR \checkmark
11 go to countertop 1 RR Repeat NN ×\times
12 go to toilet 1 RR Repeat NN ×\times
13 examine toilet 1 RR Repeat RR \checkmark
14 inventory RR Repeat, hand still empty NN ×\times
15 examine toilet 1 RR Repeat RR \checkmark
16 examine toilet 1 RR Repeat RR \checkmark
17 go to cabinet 1 EE First container attempt EE \checkmark
18 open cabinet 1 DD Reveals toiletpaper EE ×\times
19 take toiletpaper 1 from cabinet 1 DD Target acquired DD \checkmark
20 go to toiletpaperhanger 1 DD Navigate to destination NN ×\times
21 move toiletpaper 1 to toiletpaperhanger 1 DD Final placement, reward triggers DD \checkmark

A3. Pathological loop with lucky recovery.

Task: “put a cool apple in garbagecan”. Outcome: success in 34 steps, raw environment reward 10. Role distribution: 5D+8E+1N+20R5D+8E+1N+20R (Table 13). The agent enters a tight loop of 15 consecutive examine fridge 1 actions (steps 2–16) without any state change, then explores other containers for another 12 steps before acquiring the target apple at step 29 and completing the task at step 33.

Vanilla GRPO: a single positive trajectory advantage is broadcast to all 34 segments, including the 15-step examine fridge 1 loop, providing direct gradient encouragement for the policy to repeat no-op observations. This is the most acute illustration in our sample of success masking regression in hindsight credit. After thousands of such trajectories, the resulting policy would be biased toward repeating idle inspections at the start of every task. TRIAGE: under the hand-audited roles, steps 2–16 (the entire loop), steps 21–24 (alternating cabinet re-examines), and step 27 (countertop re-examine) are RR and receive lower segment advantages with λcR\lambda c_{R}. The remaining positive role-reward mass concentrates on the genuine first-time exploration (steps 0, 1, 17, 18, 20, 25, 26, 28) and the five decisive segments at the end (29–33). Net effect: of 34 segments, 5 carry strong positive credit and 8 carry moderate information-gain credit, against 34 carrying uniform positive credit under vanilla GRPO. Under the hand-audited roles, the trajectory contributes the same outcome signal but stops teaching the policy to enter the examine fridge 1 loop.

Table 13: Trajectory A3 (ALFWorld, pathological loop followed by lucky recovery), all 34 steps; hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 27/34).
tt Action Hand Justification Judge Agree
0 go to fridge 1 EE Semantic guess (fridge is the cool facility) EE \checkmark
1 open fridge 1 EE Reveals contents; apple not inside DD ×\times
2 examine fridge 1 RR Repeat of step 1 inspection EE ×\times
3 examine fridge 1 RR Repeat RR \checkmark
4 examine fridge 1 RR Repeat RR \checkmark
5 examine fridge 1 RR Repeat RR \checkmark
6 examine fridge 1 RR Repeat RR \checkmark
7 examine fridge 1 RR Repeat RR \checkmark
8 examine fridge 1 RR Repeat RR \checkmark
9 examine fridge 1 RR Repeat RR \checkmark
10 examine fridge 1 RR Repeat RR \checkmark
11 examine fridge 1 RR Repeat RR \checkmark
12 examine fridge 1 RR Repeat RR \checkmark
13 examine fridge 1 RR Repeat RR \checkmark
14 examine fridge 1 RR Repeat RR \checkmark
15 examine fridge 1 RR Repeat RR \checkmark
16 examine fridge 1 RR Repeat (15th consecutive examine fridge) RR \checkmark
17 go to cabinet 1 EE First container switch, info gain EE \checkmark
18 open cabinet 1 EE First inspection of new container DD ×\times
19 examine cabinet 2 NN Inspect without arriving at the cabinet EE ×\times
20 open cabinet 2 EE First inspection of cabinet 2 DD ×\times
21 examine cabinet 1 RR Repeat of step 18 RR \checkmark
22 examine cabinet 2 RR Repeat of step 20 RR \checkmark
23 examine cabinet 1 RR Repeat RR \checkmark
24 examine cabinet 2 RR Repeat RR \checkmark
25 go to countertop 1 EE New location EE \checkmark
26 examine countertop 1 EE First inspection EE \checkmark
27 examine countertop 1 RR Repeat RR \checkmark
28 go to countertop 2 EE New location EE \checkmark
29 take apple 1 from countertop 2 DD Target acquired DD \checkmark
30 go to fridge 1 DD Navigate to cool facility NN ×\times
31 cool apple 1 with fridge 1 DD Required transformation DD \checkmark
32 go to garbagecan 1 DD Navigate to destination EE ×\times
33 move apple 1 to garbagecan 1 DD Final placement, reward triggers DD \checkmark

G.2 WebShop Trajectories

W1. Clean optimal trajectory.

Task: “Find me hand wash men’s sleep & lounge with long sleeve, elastic waistband, color: multi 9, size: medium, price <<$80”. Outcome: success in 6 steps. Role distribution: 3D+2E+1N+0R3D+2E+1N+0R (Table 14). The agent issues a well-formed search query containing all task constraints, clicks the first returned product for inspection, selects the matching color and size attributes, and clicks buy now. A duplicate buy now after task completion is the only no-progress (NN) segment.

Vanilla GRPO: applies positive credit uniformly to all six segments; the duplicate buy now receives the same reinforcement as the three genuine decisive clicks. TRIAGE: under the hand-audited roles, the role-conditioned rule adds λcD\lambda c_{D} on the three verifier-facing DD segments, λcE\lambda c_{E} on the initial search and product inspection, and a small negative local correction on the post-completion duplicate; net effect is a slight credit concentration with no behavior change at this trajectory’s outcome level. As with A1, this trajectory exists to confirm that TRIAGE does not degrade efficient short rollouts.

Table 14: Trajectory W1 (WebShop, clean success): hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 3/6).
ss Action Hand Justification Judge Agree
0 search[hand wash men’s sleep & lounge …multi] EE Spec-aligned initial search EE \checkmark
1 click[b09nd8p2qr] EE Initial product inspection RR ×\times
2 click[multi 9] DD Color attribute match DD \checkmark
3 click[medium] DD Size attribute match DD \checkmark
4 click[buy now] DD Reward triggers RR ×\times
5 click[buy now] NN Post-completion duplicate, harmless RR ×\times

W2. Long success with redundant attribute clicks.

Task: “Find me home office furniture sets, color: navy \mid red, shape: round, size: 3’7” x 5’2”, price <<$70”. Outcome: success in 13 steps, raw environment reward 10. Role distribution: 4D+2E+2N+5R4D+2E+2N+5R (Table 15). After all attributes are selected by step 4, the agent re-clicks the same three attributes (size, shape, color) four more times before finally clicking buy now at step 9, then clicks buy now two more times after the purchase is recorded.

Vanilla GRPO: applies positive credit to all 13 segments. The five redundant attribute re-clicks at steps 5–8 and 10 receive the same reinforcement as the genuine attribute selection at steps 2–4 and the buy now at step 9. Training on many such trajectories teaches the policy a wrong lesson: that re-clicking already-selected attributes is part of the successful template. TRIAGE: under the hand-audited roles, steps 5, 6, 7, 8, 10 are RR and receive lower segment advantages through the bounded correction λcR\lambda c_{R}. Net effect: instead of 13 segments sharing the outcome credit equally, the four DD segments (containing the actual purchase logic) receive relatively higher segment advantages, while the redundant re-clicks receive lower relative credit. This trajectory is the most concrete WebShop instance of success masking regression because the wrong-lesson risk is quantitatively measurable: each redundant attribute re-click under vanilla GRPO contributes the same positive log-likelihood gradient as a legitimate DD action.

Table 15: Trajectory W2 (WebShop, success with redundant attribute clicks): hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 6/13).
ss Action Hand Justification Judge Agree
0 search[home office furniture sets navy red round …] EE Spec-aligned initial search DD ×\times
1 click[b07fkgqkz1] EE Initial product inspection EE \checkmark
2 click[3 ft 7 in x 5 ft 2 in] DD Size selected EE ×\times
3 click[round] DD Shape selected EE ×\times
4 click[navy \mid red] DD Color selected; all attributes set EE ×\times
5 click[3 ft 7 in x 5 ft 2 in] RR Redundant size re-click RR \checkmark
6 click[round] RR Redundant shape re-click RR \checkmark
7 click[navy \mid red] RR Redundant color re-click RR \checkmark
8 click[3 ft 7 in x 5 ft 2 in] RR Second redundant size re-click RR \checkmark
9 click[buy now] DD Reward triggers RR ×\times
10 click[navy \mid red] RR Post-purchase attribute re-click RR \checkmark
11 click[buy now] NN Post-completion duplicate RR ×\times
12 click[buy now] NN Post-completion duplicate RR ×\times

W3. Failure from early commit to wrong product.

Task: “Find me non slip desks for living room, color: christmasgoo3302, size: 19.7x31.5in+19.7x63in, price <<$50”. Outcome: failure in 11 steps, raw environment reward 0. Role distribution: 0D+3E+0N+8R0D+3E+0N+8R (Table 16). The initial search returns a Christmas kitchen mat (B09CQ45ZRB) as the top result. The agent clicks it at step 1, incorrectly committing to a non-desk product. Subsequent steps issue two reformulated searches that re-rank the same item to the top, and the agent clicks the same wrong product again at step 6. Steps 7–10 attempt attribute clicks and a purchase against the wrong product. The bottleneck error is step 1; the second-chance failure is step 6.

Vanilla GRPO: applies negative credit uniformly to all 11 segments because r=0r=0. This includes the two legitimate recovery search attempts at steps 4 and 5, which the agent should be encouraged to take after recognizing the wrong commitment. Uniform negative reinforcement teaches the policy to avoid recovery search-after-mistake, the exact opposite of the desired behavior. TRIAGE: under the hand-audited roles, steps 0, 4, 5 are EE (legitimate exploration: initial good-faith search and two recovery attempts). Under the rule in Section 4, EE in a failed trajectory receives the bounded process reward λcE\lambda c_{E} rather than only the negative outcome credit. Steps 1, 6 (both clicks of the wrong product) are RR and receive strong negative credit from λcR\lambda c_{R}. This illustrates outcome-mixed exploration: the recovery searches at steps 4–5 are useful exploration appearing inside a failure trajectory, and outcome-only credit assigns them the same negative sign as the wrong-product clicks. Net effect: the policy learns “do not click the wrong product twice” (the steps 1 and 6 lesson) without also learning “do not re-search after a mistake” (the spurious lesson vanilla GRPO would teach).

Table 16: Trajectory W3 (WebShop, failure via early wrong commitment): hand role, justification, and Qwen3-8B-thinking judge label (judge agreement 8/11).
ss Action Hand Justification Judge Agree
0 search[non-slip desk Christmasgoo3302 …] EE Good-faith spec-aligned search EE \checkmark
1 click[b09cq45zrb] RR Wrong product type (Christmas kitchen mat, not a desk) RR \checkmark
2 click[19.7x31.5in+19.7x63in] RR Attribute click on wrong product DD ×\times
3 click[19.7x31.5in+19.7x63in] RR Redundant repeat RR \checkmark
4 search[non slip desk christmasgoo3302 …] EE Recovery attempt: re-search EE \checkmark
5 search[non slip desk color: christmasgoo3302 …] EE Recovery attempt: refined search DD ×\times
6 click[b09cq45zrb] RR Re-clicks same wrong product RR \checkmark
7 click[christmasgoo3302] RR Attribute click on wrong product DD ×\times
8 click[christmasgoo3302] RR Redundant RR \checkmark
9 click[19.7x31.5in+19.7x63in] RR Attribute click on wrong product RR \checkmark
10 click[buy now] RR Purchases wrong product, reward stays 0 RR \checkmark

G.3 Aggregate Observations

Table 17 summarizes the role distribution in the six audited trajectories.

Table 17: Role distribution in logged GRPO rollouts (length-weighted mean over the audited ALFWorld and WebShop trajectories). Regression mass is high in both environments—exactly where TRIAGE yields its largest gains—and much of it sits inside successful trajectories that vanilla GRPO still reinforces.
Environment DD EE NN RR
ALFWorld 23% 26% 3% 48%
WebShop 23% 23% 10% 43%

The main takeaway is that regression is common in these logged rollouts, especially as redundant repetition rather than irreversible state corruption. Several successful trajectories contain substantial RR mass, so vanilla GRPO would still broadcast positive credit to repeated inspections or repeated attribute clicks. This makes RR-in-success the most important diagnostic cell for TRIAGE and motivates calibrating the role-conditioned mixing coefficient λ\lambda on a small per-environment annotated sample.

Appendix H Judge Model Audit: Side-by-Side Hand vs Qwen3-8B-Thinking Labels

Role-judge context window.

For a segment kk, the training-time role judge sees a bounded local window around that segment. In our implementation, the window contains the task goal, up to five previous action–observation pairs (ai,k5,oi,k5,,ai,k1,oi,k1)(a_{i,k-5},o_{i,k-5},\ldots,a_{i,k-1},o_{i,k-1}), the current action ai,ka_{i,k}, the immediate resulting observation oi,ko_{i,k}, and up to five future action–observation pairs (ai,k+1,oi,k+1,,ai,k+5,oi,k+5)(a_{i,k+1},o_{i,k+1},\ldots,a_{i,k+5},o_{i,k+5}) when they exist. Boundary cases use the available prefix or suffix.

The short future window helps identify whether an exploratory segment enabled later progress or whether an apparently harmless step was redundant. We do not feed the entire trajectory to every segment-level judge call because long inputs make repeated high-quality judging expensive and empirically make the classifier less focused on the local causal role.

Controlling the input length keeps the role classifier usable at segment scale and reduces the chance that it relies on distant recovery patterns instead of the current action. The judge still does not receive the final verifier outcome or an unbounded future trajectory, so the role label diagnoses local causal behavior rather than copying the trajectory-level reward that GRPO already supplies.

Setup.

We audit a Qwen3-8B judge with thinking mode enabled on 18 logged trajectories (9 success, 9 failure) across three environments: 3 ALFWorld (captured from the trained GRPO policy), 3 WebShop (trained policy), and 12 Search-QA (base-model rollouts to obtain failure-rich data). To keep the ground truth independent of the rubric design, two annotators who did not participate in defining the role taxonomy of Section 3 each labeled all 135 segments independently. The two annotators reached 88.1% raw label agreement (119 of 135 segments); segments on which they disagreed were adjudicated by a senior annotator, and the adjudicated labels are used as ground truth. For each audited segment, the judge was given the same bounded window used during training: the task, up to five previous action–observation pairs, the current action and immediate observation, and up to five future action–observation pairs when available. The judge was not given the final verifier outcome or the unbounded full trajectory. It was asked to output one role for the current segment using the Qwen3 chat-template enable_thinking=True flag. All inference used temperature 0. Together with the merged ALFWorld and WebShop tables in Appendix G, this appendix reports every trajectory with both hand and judge labels per step. Aggregate judge metrics are reported in Table 4.

Judge prompt.

The audit used the following role-classification prompt. We require a short evidence string for every segment, which forces the judge to ground each label in the local action–observation context rather than emitting only a free-floating role tag; in practice this makes label audits easier and improves judge consistency.

You are an expert evaluator of multi-turn agent trajectories.

You will see a local window around one target segment: the task, up to
five previous action-observation pairs, the CURRENT action and observation,
and up to five future action-observation pairs. Classify only the CURRENT
action into ONE of four roles:

    D (DECISIVE)    The action completes a required sub-goal or makes a
                                    verifier-checkable state change directly required by
                                    the task (e.g. takes the target object, performs a
                                    required transformation like cool/heat/clean, places
                                    the target in the destination, executes the final
                                    purchase, selects a task-mandated attribute).

    E (EXPLORATION) The action gathers information or visits a new state
                                    for the first time without completing a sub-goal.
                                    First-time inspection of a container, first navigation
                                    to a candidate location, an initial search query,
                                    a refined search after recognizing a wrong commitment.

    N (NO-PROGRESS) The action neither changes the task state nor reveals
                                    new information. Empty-handed traversal, harmless
                                    duplicate after task completion, generic navigation
                                    through an irrelevant location with no investigation.

    R (REGRESSION)  Clear setback: the action either corrupts state,
                                    picks the wrong object, commits to a non-matching
                                    product, performs the wrong transformation, OR is a
                                    redundant repeat of an already-completed action that
                                    yields no new information ("examine X" when X was just
                                    examined; re-click of an already-selected attribute;
                                    re-purchase after success).

CALIBRATION RULES
    - Judge LOCAL causal role using only the supplied window. Do not infer
        credit from distant recovery or distant failure outside the window.
    - For the current step, provide brief evidence grounded in the local
        action/observation, e.g. "first inspection reveals new object",
        "repeat with no new information", or "correct target acquired".
    - First-time examine/inspect = E. Second-time examine of the same
        target without state change = R.
    - "Nothing happens." in observation means the action was invalid;
        if action repeats, label R.
    - A buy/place/take/heat/cool of the correct target = D.
    - Re-click of already-selected attribute = R, even if the local observation
        reports success.

OUTPUT FORMAT
After your reasoning, output ONLY a JSON object on a single line at
the very end:
{"labels": ["D"|"E"|"N"|"R", ...], "evidence": ["short reason per step", ...]}
Both lists must have length equal to the number of steps shown.

ALFWorld and WebShop trajectories.

The six ALFWorld and WebShop trajectories audited here (A1–A3, W1–W3) are the same rollouts analyzed in Appendix G. To avoid duplicating their per-step action listings, their hand labels, Qwen3-8B-thinking judge labels, and per-step agreement are reported together with the role-distribution analysis in Tables 1116 (judge agreement per trajectory is stated in each caption). The Search-QA trajectories below are audited only here.

Search-QA trajectory summary.

Table 18 summarizes all 12 Search-QA audit trajectories. The table keeps the outcome, question, number of search turns, final answer, hand-label sequence, judge-label sequence, and agreement count; Table 19 then gives the only Search-QA disagreement case step by step.

Table 18: Search-QA audit summary. Label sequences are ordered by environment-facing segment; SmAS^{m}\!\to\!A denotes mm search turns followed by one answer turn.
ID Outcome Question / final answer Pattern Hand labels Judge labels Agreement
SQ-F1 fail first Nobel Prize in Physics / Wilhelm Röntgen S2AS^{2}\to A E,E,R E,E,R 3/3
SQ-F2 fail next Deadpool movie release / Deadpool 3 S3AS^{3}\to A E,E,E,R E,E,E,R 4/4
SQ-F3 fail short-wave broadcast mode / AM S3AS^{3}\to A E,E,E,R E,E,E,R 4/4
SQ-F4 fail southwest wind across Nigeria / February and June S3AS^{3}\to A E,E,E,R E,E,E,R 4/4
SQ-F5 fail first declaration of human rights / John Peters Humphrey S3AS^{3}\to A E,E,R,R E,E,E,R 3/4
SQ-F6 fail next Scandal episode / April 19, 2018 S3AS^{3}\to A E,E,R,R E,E,R,R 4/4
SQ-F7 fail Philadelphia last Super Bowl win / 2018 S2AS^{2}\to A E,E,R E,E,R 3/3
SQ-F8 fail first lady nominated to Rajya Sabha / Rajvanshi Devi S3AS^{3}\to A E,R,R,R E,R,R,R 4/4
SQ-S1 success Swan Lake, Sleeping Beauty, Nutcracker composer / Pyotr Ilyich Tchaikovsky S2AS^{2}\to A E,E,D E,E,D 3/3
SQ-S2 success Dragon Ball Z episode count / 291 S2AS^{2}\to A E,E,D E,E,D 3/3
SQ-S3 success garden city of New Earswick designer / Raymond Unwin S3AS^{3}\to A E,E,E,D E,E,E,D 4/4
SQ-S4 success filming location of The Curse of Oak Island / Oak Island S2AS^{2}\to A E,E,D E,E,D 3/3

Representative Search-QA disagreement.

SQ-F5 is the only Search-QA trajectory in this audit where Qwen3-8B-thinking disagrees with the adjudicated hand labels. The disagreement is instructive: the third search shifts from the human-rights question to a declaration-of-independence query, so annotators mark it as regression, while the judge still treats it as exploration.

Table 19: Trajectory SQ-F5: representative Search-QA disagreement.
tt Action Hand Qwen3-8B-think Agree
0 <search>who wrote the first declaration of human rights</sea EE EE \checkmark
1 <search>who wrote the first declaration of human rights decl EE EE \checkmark
2 <search>who wrote the declaration of independence</search> RR EE ×\times
3 <answer>John Peters Humphrey</answer> RR RR \checkmark

Audit blind spot.

None of the nine failure trajectories in this set contain any hand-labeled DD segment. This is a structural property of the calibration set rather than a sampling artifact: WebShop W3 commits to the wrong product at step 1 and accumulates only RR thereafter, and the eight Search-QA failures all terminate with a wrong \langleanswer\rangle (RR) after a sequence of \langlesearch\rangle queries (EE). The cell “DD in failed rollouts” is therefore not measurable on this calibration set. Verifying that the judge correctly identifies decisive intermediate progress within failed trajectories (for example, an ALFWorld agent that correctly heats the target object but then places it in the wrong receptacle, or a Search-QA agent that correctly identifies the bridge entity but issues a malformed final answer) is the principal extension required of a larger follow-up audit.

Appendix I Verbatim Search-QA Query Reformulations

For completeness, this section restores the exact per-step text behind the compressed label sequences of Table 18. For every one of the twelve audited Search-QA trajectories we list each environment-facing segment in order: search-turn queries followed by the final answer turn, reproduced verbatim from the logged rollout (a few strings were truncated at capture time and are marked with “…”). Each segment is tagged with its adjudicated hand role [D/E/N/R][D/E/N/R]; the single judge disagreement in this set is SQ-F5 segment 2, where the judge predicts EE instead of the adjudicated RR, marked [R|judge E][R\,|\,\text{judge }E]. Segments separated by “\;\mid\;”; the answer turn is prefixed by aa.

Failure trajectories.

SQ-F1

(answer “Wilhelm Röntgen”) s0s_{0} [E] first nobel prize in physics \;\mid\; s1s_{1} [E] Wilhelm Röntgen first Nobel Prize in Physics \;\mid\; aa [R] Wilhelm Röntgen.

SQ-F2

(answer “Deadpool 3”) s0s_{0} [E] next Deadpool movie release \;\mid\; s1s_{1} [E] next Deadpool movie release date \;\mid\; s2s_{2} [E] Deadpool 3 movie release date \;\mid\; aa [R] Deadpool 3.

SQ-F3

(answer “AM”) s0s_{0} [E] short wave broadcast service \;\mid\; s1s_{1} [E] short wave broadcast service am \;\mid\; s2s_{2} [E] short wave broadcast service mode \;\mid\; aa [R] AM.

SQ-F4

(answer “February and June”) s0s_{0} [E] south west wind blows across Nigeria \;\mid\; s1s_{1} [E] south west wind blows across Nigeria between \;\mid\; s2s_{2} [E] south west wind blows across Nigeria between Februar…\;\mid\; aa [R] February and June.

SQ-F5

(answer “John Peters Humphrey”) s0s_{0} [E] who wrote the first declaration of human rights \;\mid\; s1s_{1} [E] who wrote the first declaration of human rights decl…\;\mid\; s2s_{2}[R|judge E][R\,|\,\text{judge }E] who wrote the declaration of independence \;\mid\; aa [R] John Peters Humphrey.

SQ-F6

(answer “April 19, 2018”) s0s_{0} [E] “Scandal” TV show \;\mid\; s1s_{1} [E] Scandal next episode \;\mid\; s2s_{2} [R] Scandal next episode (verbatim repeat of s1s_{1}) \;\mid\; aa [R] April 19, 2018.

SQ-F7

(answer “2018”) s0s_{0} [E] Philadelphia last Super Bowl win \;\mid\; s1s_{1} [E] Philadelphia Eagles last Super Bowl win \;\mid\; aa [R] 2018.

SQ-F8

(answer “Rajvanshi Devi”) s0s_{0} [E] first lady nominated member of the rajya sabha \;\mid\; s1s_{1} [R] first lady nominated member of the rajya sabha (repeat) \;\mid\; s2s_{2} [R] first lady nominated member of the rajya sabha (repeat) \;\mid\; aa [R] Rajvanshi Devi.

Success trajectories.

SQ-S1

(answer “Pyotr Ilyich Tchaikovsky”) s0s_{0} [E] “Swan Lake the Sleeping Beauty and the Nutcracker fa…\;\mid\; s1s_{1} [E] Swan Lake the Sleeping Beauty and the Nutcracker fam…\;\mid\; aa [D] Pyotr Ilyich Tchaikovsky.

SQ-S2

(answer “291”) s0s_{0} [E] Dragon Ball Z episodes \;\mid\; s1s_{1} [E] Dragon Ball Z episodes 291 \;\mid\; aa [D] 291.

SQ-S3

(answer “Raymond Unwin”) s0s_{0} [E] garden city of new earswick \;\mid\; s1s_{1} [E] who designed the model village of new earswick \;\mid\; s2s_{2} [E] Raymond Unwin designed the garden city of new earswi…\;\mid\; aa [D] Raymond Unwin.

SQ-S4

(answer “Oak Island”) s0s_{0} [E] “The Curse of Oak Island filmed” \;\mid\; s1s_{1} [E] where is the tv show the curse of oak island filmed \;\mid\; aa [D] Oak Island.