TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning
Abstract
Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone—a projection of the per-segment advantage residual onto the role variable—so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional and relative to GRPO.
1 Introduction
Reinforcement learning with verifiable rewards has become a standard recipe for improving language-model reasoning and agentic behavior (Shao et al., 2024; DeepSeek-AI et al., 2025; Trung et al., 2024; Yu et al., 2025; Xu et al., 2026a). In Group Relative Policy Optimization (GRPO), a policy samples multiple trajectories for a prompt, receives final rewards from a verifier, and assigns relative advantages to the sampled outputs. This recipe is attractive because it requires no learned value model and optimizes directly against the deployment policy. However, when the output is an agentic trajectory rather than a single answer, the central credit-assignment question changes: which environment-facing actions deserve credit when supervision arrives only as a final verifier outcome?
The unit of decision in this setting is not an arbitrary token span. It is an environment-facing segment: a search query, click, file edit, command, object interaction, or tool call that changes either the external state or the agent’s information state. Across WebShop, Search-QA, and ALFWorld, such segments range from decisive actions (final purchases, answer submissions, object placements) to information-gathering actions (searches, inspections, reads) and low-value infrastructure (repeated navigation or redundant clicks).
Outcome credit is therefore useful but structurally incomplete. Standard GRPO treats all segments equally within a trajectory: if the trajectory succeeds, all action tokens are reinforced; if it fails, all are suppressed. This creates two systematic blind spots. First, failed rollouts can contain useful exploratory actions that should not inherit the full negative outcome credit. Second, successful rollouts can contain redundant or harmful actions that should not inherit positive credit merely because the agent later recovered. Final outcome tells us whether the trajectory solved the task, but it cannot say what local role each segment played.
Recent credit-assignment and on-policy supervision methods address parts of this problem. State-anchored estimators compare actions from matched states; process reward models learn dense progress signals; outcome-statistical methods estimate whether recurring segments concentrate in successful or failed rollouts; and token-importance methods reweight supervision within sampled outputs (Wang et al., 2025; Lu et al., 2026; Xu et al., 2026b). These approaches are useful, but they usually score each segment without specifying its semantic role: task progress, belief-state progress, harmless infrastructure, and regression should not receive the same credit rule. We test this distinction directly by comparing against two dense-signal controls—a scalar LLM process-reward baseline with the same judge and context window, and an outcome-supervised shared-backbone value baseline—so the empirical question is not whether dense segment rewards help, but whether role typing adds information beyond them.
Our central claim is therefore: agentic RL needs a role axis in addition to an outcome axis. The most important distinction is that exploration is not no-progress. Exploration often has zero immediate task progress and may appear in both successful and failed trajectories. A purely outcome-statistical estimator can under-credit it because exploratory actions are not always success-specific. A generic process scorer can also conflate exploration with no-progress when no subgoal is completed immediately. Yet suppressing exploration is precisely how sparse-reward agent training becomes brittle: the policy learns to avoid information-gathering actions before it has enough information to act decisively.
We propose TRIAGE, a simple framework for role-aware credit estimation. Like medical triage, which sorts patients by the kind of attention they need before allocating treatment, TRIAGE first sorts each environment-facing segment into a semantic role before deciding how much credit it should inherit from the trajectory outcome. TRIAGE uses a structured LLM judge as a role classifier, not as an unconstrained reward model. Given a bounded local context around each segment, the judge assigns one primary role: decisive progress, useful exploration, no-progress infrastructure, or regression. The RL algorithm then maps roles to different credit rules. Decisive progress receives strong outcome-aligned credit, useful exploration receives bounded positive credit, no-progress infrastructure is dampened toward zero, and regression is suppressed even when it appears in an otherwise successful trajectory.
This design deliberately separates semantic diagnosis from optimization direction. An LLM is well suited to answering local questions such as whether an action inspected a relevant file, narrowed a search, damaged state, or repeated known information. It is less suited to replacing the verifier. TRIAGE therefore keeps the GRPO outcome advantage as the base training signal and uses the role classifier only to add bounded process rewards or penalties at the segment level.
We make four contributions:
-
1.
We identify two structural blind spots of outcome-only segment credit—useful exploration in failed rollouts and regression inside successful rollouts—and define a four-role taxonomy that adds a semantic role axis to trajectory-level outcome credit.
-
2.
We introduce TRIAGE, a role-conditioned credit assignment framework that uses a structured LLM judge for semantic role typing while keeping the GRPO outcome advantage as the source of optimization direction.
-
3.
We give a theoretical justification: role-conditioned credit is the MSE-optimal segment correction measurable from role labels alone, the fixed role constants reduce advantage estimation error whenever they are aligned with this optimum, and this connects to unbiased, lower-variance policy gradients (Section 4.1).
-
4.
We empirically evaluate TRIAGE across diverse agentic tasks and show consistent gains over GRPO, scalar judge-derived process rewards, and an outcome-supervised value baseline, while using manually labeled segments and role diagnostics to explain when the improvement comes from exploration retention, infrastructure damping, or regression suppression.
2 Problem Setup: Segment Credit in Agentic RL
GRPO.
Given a task prompt , GRPO samples trajectories, scores each with a verifier , and assigns the group-normalized advantage uniformly to every token in the trajectory. Some environment logs report raw success rewards on a different scale, such as 10 for success and 0 for failure; throughout training and in all equations, we binarize these raw rewards to .
From outcome credit to segment credit.
An agentic trajectory consists of environment-facing action segments and their resulting observations . Broadcasting a single to all segments treats a decisive purchase click, a useful diagnostic read, a harmless no-op, and a wrong edit identically. Process reward models offer one response by learning a dense value or progress score for each step (Lightman et al., 2024), but they do not by themselves specify whether a segment is exploration, infrastructure, or regression. Our goal is a segment-level advantage that reflects not only how good a segment is, but what role it plays—which requires a structured label rather than a role-agnostic score.
3 Why Outcome Credit Is Structurally Incomplete
Outcome credit supplies the correct trajectory-level direction, but it is a one-axis signal. It partitions rollouts into success and failure, then assigns all local decisions the same sign within each rollout. Agentic trajectories need a second axis: the local semantic role of each segment. Table 1 shows the two conflict cells that motivate this paper. A useful segment in a failed rollout should not be fully punished, and a regressive segment in a successful rollout should not inherit positive credit.
| Local segment role | Successful rollout | Failed rollout |
|---|---|---|
| Useful local segment | should receive positive credit | should not be fully punished |
| Regressive local segment | should not inherit success credit | should be suppressed |
We instantiate this missing role axis with four segment types. Define a role variable
| (1) |
where denotes decisive progress, useful exploration, no-progress infrastructure, and regression. Table 2 gives the operational definition.
| Role | Definition | Examples | Default credit rule |
|---|---|---|---|
| Decisive () | Produces verifier-checkable progress or completes a necessary subgoal | take target; buy item; correct answer | strong outcome-aligned |
| Exploration () | Reveals relevant state without immediate completion | read test; constrained search; inspect container | conditional positive |
| No-progress () | Changes neither task state nor belief state, but is harmless | duplicate click after completion; empty traversal | slightly penalized |
| Regression () | Corrupts state or repeats without information gain | wrong edit; wrong purchase; repeated examine/click | negative |
The taxonomy is intentionally not just an ordering by amount of progress. Exploration is not merely a small amount of progress. It is a different type of progress: it improves the information state rather than the environment state. This matters because many agent tasks are partially observable. Before editing a file, the agent must inspect relevant code and tests. Before buying an item, it must search and compare. Before manipulating an object, it may need to discover where the object or receptacle is. These actions should not be treated like repeated boilerplate just because they do not immediately satisfy the final verifier.
Role boundaries.
The role boundaries are defined by what the segment changes. changes verifier-checkable task state: taking the target object, selecting the required item, submitting the correct answer, or applying the edit that makes a test pass. changes the information state without yet completing a subgoal: opening a container, reading a failing test, or running a targeted search. This boundary can be blurry in hindsight because an exploratory action may enable a later decisive one, but we reserve for direct task-state progress and use for first-time, reasonable information collection.
and cover the cases that should not receive positive progress credit. is harmless infrastructure that changes neither task state nor information state, such as an empty traversal or a generic command that does not affect the next decision. is locally harmful or redundant without information gain: a wrong edit, wrong purchase, corrupted object state, or repeated inspection/click after the relevant information is already known. Final outcome cannot resolve these distinctions. Useful exploration can appear in failed trajectories, and regression can appear in successful ones after later recovery, so role-aware credit must judge the local segment rather than only its trajectory-level success label.
What the judge must get right.
The judge does not need perfect boundary agreement. Its key capability is asymmetric error correction: in successful rollouts, find local regressions that should not inherit positive credit; in failed rollouts, find locally useful segments that should not inherit full negative credit. Operationally, regression has two subclasses: state corruption (wrong edit, wrong purchase, wrong object) and redundant-without-information-gain (repeated inspection or click after the information is already known).
Implications for diagnostics.
The taxonomy also determines what we measure experimentally. Useful exploration is outcome-mixed: it appears in both successful and failed rollouts, so outcome association can make it look neutral or negative. No-progress infrastructure receives nonzero advantage under uniform broadcasting, wasting gradient on boilerplate actions. Regression can appear inside successful trajectories after later recovery, so final outcomes hide local harm. We therefore track three diagnostics in the experiments: exploration retention, infrastructure damping, and regression suppression.
4 TRIAGE: Role-Conditioned Segment Credit
TRIAGE has two components: a structured role judge and a role-conditioned process reward. The policy update remains the standard GRPO update. Rather than using the LLM judge as an unconstrained scalar reward model, TRIAGE uses a rubric-guided judge to assign one auditable semantic role per segment, and maps those roles to fixed credit rules. The only change is the advantage assigned to each environment-facing segment: we keep the trajectory-level GRPO advantage and add a bounded process reward whose form depends on the segment role.
Role-judge context window.
The training-time role judge uses a bounded local context window around each segment; in our experiments this window includes up to five previous and five future action–observation pairs. Appendix H gives the exact window definition. The judge does not receive the final verifier outcome.
Let be the outcome advantage for trajectory . For segment , TRIAGE defines
| (2) |
where is a fixed process reward for the assigned role and controls how strongly this local signal is mixed into the GRPO advantage. The auxiliary judge scores are used only to help choose the role label, not as additional training-time notation.
A simple instantiation sets
| (3) |
Thus decisive progress receives a unit process reward, useful exploration receives a smaller positive reward, no-progress infrastructure receives only a small step cost, and regression receives a larger local penalty even if the trajectory succeeds. This scale follows the usual agent-RL convention that task progress is around , harmless inefficiency receives a mild penalty around , and clearly unhelpful actions receive a stronger negative reward. This keeps the main comparison close to GRPO: the dominant signal is still the outcome advantage, while role typing adds only a bounded segment-level process reward.
Unless otherwise stated, we use for Search-QA and for the other two environments, keeping fixed across tasks. The role constants are never tuned; the only tuned hyperparameter is , selected on the training split by training success rate with the test set held out for final evaluation. The grids in Appendix F are post-hoc sensitivity analyses and were not used to choose .
For stability, the resulting segment advantages are whitened within each batch before being broadcast to segment tokens:
| (4) |
The policy update is the usual clipped GRPO objective with assigned to tokens belonging to segment . In the evaluated environments, a segment coincides with the standard environment step used in prior agent-RL work: one admissible ALFWorld command, one WebShop search[...] or click[...] action, or one Search-QA search query or final answer submission. The segment advantage is applied only to generated tokens in the corresponding environment-facing turn; prompt and observation tokens are excluded from the policy loss.
Training procedure.
In each GRPO batch, we first compute the usual trajectory advantage . We then split each rollout into environment-facing action segments and ask the role judge for the segment role and auxiliary scores . The role-conditioned process reward is added to the GRPO advantage, the resulting segment advantages are normalized within the batch, and each normalized value is broadcast to the tokens in that segment before the standard clipped GRPO update. No judge is used at evaluation time.
4.1 Theoretical Justification: Role Conditioning as an Optimal Projection
We give a justification, not a guarantee: under a stated sufficiency assumption, role-conditioned credit is the best segment-level correction expressible from role labels alone, and the fixed constants used by TRIAGE inherit a strictly smaller estimation error than uniform broadcasting whenever aligned with this optimum. We connect this to lower-variance policy gradients and flag where the assumption fails in Appendix B; all proofs are in Appendix A.
Setup.
Let denote the (unobserved) oracle per-segment advantage and let be the trajectory advantage that GRPO broadcasts to every segment. Define the credit residual
| (5) |
the within-trajectory variation in true credit that uniform broadcasting discards. A segment-level estimator that adds a correction to incurs squared error .
Proposition 1 (Optimal role-measurable correction).
Among all corrections that are measurable with respect to the segment role , the minimizer of the segment-advantage MSE is the conditional expectation of the residual,
| (6) |
and the resulting MSE reduction relative to GRPO is
| (7) |
Proposition 1 formalizes the paper’s central claim: role labels help exactly to the extent that they explain nonzero credit residual, i.e. whenever for some role. The four-role taxonomy is thus an interpretable, coarse discretization of the Bayes-optimal correction , with (regression is over-credited by broadcasting) and in failed rollouts (exploration is over-punished)—precisely the two conflict cells of Table 1.
TRIAGE uses fixed role constants rather than estimating . For the correction , the MSE change relative to GRPO is
| (8) |
so any positively aligned role signal reduces error for sufficiently small . This is exactly the desired sign pattern: negative for regression that GRPO over-credits and positive for exploration that GRPO over-punishes. Appendix B gives the full fixed-constant condition, connects the correction to policy-gradient variance, and states the failure modes.
5 Experiments
We design experiments to test role-aware credit rather than merely final performance. The central empirical question is whether TRIAGE preserves useful exploration while suppressing no-progress and regression.
5.1 Experimental Setup
Environments.
We evaluate on three families of agentic tasks. ALFWorld tests embodied household planning with templated actions (Shridhar et al., 2021). Search-QA tests multi-turn retrieval and answer generation, where query formulation and evidence gathering are exploratory (Dunn et al., 2017). WebShop tests product search and purchase (Yao et al., 2022), where search/filter actions are exploratory and the purchase action is decisive.
Models and training.
We evaluate Qwen2.5-7B-Instruct and Qwen3-1.7B-Instruct as deployable student policies for all three environments (Yang et al., 2024). Training uses GRPO with rollouts per prompt, implemented on top of the verl framework (Sheng et al., 2025). TRIAGE uses the same rollouts and verifier rewards as GRPO, plus cached role labels from an LLM judge. All final evaluations use the unaided deployment policy without judge calls. For ALFWorld and WebShop, we repeat training and evaluation with ten independent runs and report mean sample standard deviation. Search-QA runs are substantially more expensive because each optimization step requires large-model rollout with multi-turn retrieval and verifier evaluation, so Search-QA results are reported from a single run under the same fixed training configuration; consequently Search-QA entries in the tables do not include a standard deviation.
5.2 Main Results
| Model | Method | ALFWorld | Search-QA | WebShop |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | GRPO | 43.3 | ||
| TRIAGE w/ Qwen3-8B no-think judge | 45.0 | |||
| TRIAGE w/ Qwen3-8B-thinking judge, no evidence | 46.4 | |||
| TRIAGE w/ Qwen3-8B-thinking judge | ||||
| Qwen3-1.7B-Instruct | GRPO | 39.4 | ||
| TRIAGE w/ Qwen3-8B no-think judge | 40.2 | |||
| TRIAGE w/ Qwen3-8B-thinking judge, no evidence | 41.1 | |||
| TRIAGE w/ Qwen3-8B-thinking judge | 42.3 |
Figure 1 summarizes the main comparison, and Table 3 reports the underlying numbers. With the default Qwen3-8B-thinking judge, TRIAGE improves over GRPO on all three benchmarks for both policies, with the largest gains on ALFWorld and WebShop—the two audited environments with the highest regression mass (48% and 43%; Appendix G). The Search-QA gain is smaller but consistent, matching its more exploration-dominated, lower-regression profile. This pattern is what role-conditioned credit predicts: most of the benefit comes from withholding positive credit from regressive segments that vanilla GRPO reinforces whenever the trajectory happens to succeed.
The comparison also shows that the benefit depends on judge reliability rather than on simply adding a dense reward. Substituting the Qwen3-8B no-think judge—which collapses on the -in-success cell (Table 4)—drives TRIAGE below the GRPO baseline on ALFWorld and WebShop for both policies, confirming that the gains stem from accurate role typing and not from the extra reward term alone. Removing the evidence requirement (“no evidence” rows) keeps TRIAGE above GRPO but consistently trails the default prompt, so thinking is necessary for the hard -in-success cell while structured evidence acts as a low-cost calibration knob on top of it.
5.3 Does the Judge Recover the Conflict Cells?
Because TRIAGE relies on a role judge, we audit whether the judge recovers local segment roles rather than simply echoing the final outcome. Two annotators independently label 135 environment-facing segments from 18 logged trajectories (3 ALFWorld, 3 WebShop, 12 Search-QA), reaching 88.1% raw agreement; disagreements are adjudicated by a senior annotator and used as ground truth. The prompt, labels, and examples are in Appendix H.
Table 4 reports binary F1 by role–outcome cell, focusing on the two conflict cells: inside successful rollouts and inside failed rollouts. We omit in failed rollouts because it has zero support in this labeled set.
| Config | in success rollouts | in failed rollouts | in success rollouts | in failed rollouts | in success rollouts |
|---|---|---|---|---|---|
| 8B no-think | 29.2 | 81.1 | 56.1 | 90.0 | 55.6 |
| 8B think | 86.1 | 91.9 | 78.7 | 95.2 | 65.1 |
| 14B no-think | 5.7 | 80.0 | 54.0 | 90.9 | 62.9 |
| 14B think | 72.7 | 86.5 | 70.8 | 90.9 | 56.4 |
| 32B no-think | 35.9 | 74.3 | 56.4 | 82.6 | 73.7 |
| 32B think | 88.6 | 83.3 | 70.8 | 88.9 | 65.1 |
The result supports the two-blind-spot framing. Thinking is not uniformly useful; its large effect is concentrated in -in-success, where it raises F1 from roughly 24 to 82 averaged over model sizes. The easy cell is -in-failure (F1 even without thinking); the hard cell is finding regression exactly where the verifier says the rollout succeeded. Scaling helps less than enabling thinking: 8B-thinking is within three F1 points of 32B-thinking on -in-success at substantially lower inference cost. We therefore use Qwen3-8B with thinking enabled as the default judge.
5.4 Comparisons and Ablations
All comparisons and ablations in this section use Qwen2.5-7B-Instruct. We organize the analysis around three questions: how TRIAGE compares with stronger credit-assignment baselines, whether role typing adds value beyond generic dense process rewards, and whether the trained policy exhibits the intended behavioral changes.
External credit-assignment baselines.
Table 5 situates TRIAGE against stronger credit-assignment baselines reproduced under an identical protocol: PPO with a learned critic, GiGPO, which assigns step-level credit by grouping actions from recurring states (Feng et al., 2025), and a shared-backbone value baseline that learns a dense per-segment signal from the same verifier rewards. TRIAGE improves over PPO on all three benchmarks without a separate value network. Relative to GiGPO, TRIAGE is higher on WebShop and statistically tied on ALFWorld, while GiGPO does not apply to Search-QA because its state grouping degenerates when per-step states almost never recur. Relative to the value baseline, TRIAGE tests the central claim of the paper: dense segment credit alone is not enough when productive and regressive actions have similar outcome-trained values, and the missing information is the segment’s semantic role. The key difference is signal source: GiGPO derives micro-advantages structurally from recurring states, the value baseline derives them statistically from outcome regression, and TRIAGE derives them semantically from role labels—targeting the conflict cells that role-agnostic dense signals cannot resolve.
| Method | ALFWorld | Search-QA | WebShop |
|---|---|---|---|
| GRPO | 43.3 | ||
| PPO | 45.3 | ||
| GiGPO | – | ||
| Shared-backbone value baseline (App. D) | 46.8 | ||
| TRIAGE w/ Qwen3-8B-thinking judge |
The shared-backbone value baseline improves over GRPO on the two longer-rollout environments (ALFWorld , ; Search-QA , ), confirming that a learned dense per-segment baseline trained on the same verifier reward is a meaningful upgrade over uniform broadcast. On WebShop, however, it barely moves (, within run-to-run variance), while TRIAGE reaches . The reason is structural: WebShop regressions are repeated clicks of an already-selected attribute that leave the observation almost unchanged, so an outcome-trained value head cannot separate the productive click from its redundant repeat, whereas the role classifier reads the action history and labels the repeat . Appendix D gives the full analysis.
Role-reward ablations.
We also include a scalar process-reward baseline to separate the value of role typing from the value of adding any judge-derived dense reward. This baseline uses the same Qwen3-8B-thinking judge and the same local context window as TRIAGE, but asks for a single progress score rather than a discrete role. We add this score to the GRPO advantage as
| (9) |
and apply the same batch whitening as TRIAGE. This controls for judge access, local context, and dense reward shaping while removing role-conditioned credit rules. Thus the comparison isolates whether the advantage comes from a generic process reward or from the role-specific mapping that treats exploration, no-progress infrastructure, and regression differently.
| Ablation | Change | ALFWorld | Search-QA | WebShop |
|---|---|---|---|---|
| Raw GRPO | no role judge or process reward | 43.3 | ||
| Scalar process reward | 8B-thinking judge, no role typing | 45.9 | ||
| No regression penalty | set | 46.7 | ||
| No exploration bonus | set | 47.5 | ||
| TRIAGE | none |
Table 6 isolates the two role-reward components and the role-typing effect itself. The scalar process-reward baseline improves over GRPO, confirming that dense segment feedback is useful, but it remains below TRIAGE on every benchmark. Removing either role component further degrades TRIAGE, so the gain is not an artifact of simply adding a dense reward from the same judge. The regression penalty () is the dominant contributor: zeroing it costs – points across benchmarks and leaves ALFWorld and WebShop only marginally above raw GRPO. The exploration bonus () provides a smaller but consistently positive top-up (– points). This ordering matches the role audit: ALFWorld and WebShop carry regression mass of and (Appendix G), so most of TRIAGE’s gain comes from suppressing credit inside successful trajectories. Consistent with this mechanism, TRIAGE also reduces completed-rollout length by and relative to GRPO on the two environments (Appendix E). TRIAGE is stable to the role-constant scale and within a reasonable range (Appendix F).
6 Discussion and Limitations
Limitations.
Role labels are semantic estimates, not ground truth. A judge can overvalue plausible exploration, miss subtle regressions, or rely too much on final outcomes. TRIAGE mitigates this by using the judge only for structured role diagnosis and keeping verifier outcomes as the base optimization signal, but it does not remove judge error.
Role usefulness is also context-dependent. The same search, read, or test command can be informative once and redundant later, so the classifier must condition on local state and redundancy rather than action strings alone. Finally, role-aware credit is not causal identification: it improves local attribution, but counterfactual environment interventions would be needed to prove that a segment was necessary.
Future work.
This paper uses one primary role per segment to keep the signal auditable. A natural extension is a soft role distribution, e.g., , with credit computed as an expectation under role-specific constants. This could better represent mixed segments, such as a search that reveals useful evidence while also introducing distractors, but it would require reliable calibration and stronger audit procedures.
TRIAGE is also compatible with segment bucketing and outcome-statistical estimators. Bucketing can decide which segments share statistical evidence, while role labels decide how that evidence should be interpreted. Combining the two is a promising direction for domains where exact action arguments are sparse and repeated segments are rare.
Finally, the discrete four-role label is only the first layer of role-aware judging. On harder tasks or stronger base agents, obvious loops, wrong purchases, and repeated inspections become rare, and the credit problem shifts from detecting coarse failures to estimating how much each segment advances the task or belief state. In that regime the same framework can use a stronger judge to assign finer-grained process rewards within each role rather than a single discrete label.
7 Related Work
| Method family | Granularity | Expl. no-prog. | Regr. in success | No state match |
|---|---|---|---|---|
| Outcome / group RL | trajectory | ✗ | ✗ | ✓ |
| GiGPO (Feng et al., 2025) | step | ✗ | partial | ✗ |
| Step / process rewards (Wang et al., 2025; Lightman et al., 2024) | step | ✗ | ✗ | ✓ |
| TRIAGE (ours) | step | ✓ | ✓ | ✓ |
Agentic credit assignment.
Agentic RL requires assigning credit across environment-facing decisions rather than only across tokens. Table 7 summarizes the closest design choices. State-anchored methods such as GiGPO compare actions taken from matched states (Feng et al., 2025); stepwise progress and process-reward methods learn scalar dense scores for intermediate steps (Wang et al., 2025; Lightman et al., 2024). TRIAGE is complementary: it keeps the outcome advantage but adds a semantic role label, so the update can distinguish useful exploration from no-progress behavior and regression from ordinary low progress.
Process reward models and LLM judges.
Process reward models provide dense supervision by scoring intermediate reasoning or agent steps (Lightman et al., 2024). LLM-as-judge methods can evaluate generated outputs, critique trajectories, or assign rubric scores (Shinn et al., 2023; Madaan et al., 2023; Fang et al., 2026). Unstructured process scores can be brittle: they may punish correct actions in failed trajectories, over-credit plausible narration, or conflate exploration with lack of progress. TRIAGE uses the judge more narrowly as a structured classifier over segment roles. This reduces the burden on the judge and makes the resulting signal easier to audit.
Exploration in language agents.
Language agents often rely on information-gathering actions such as search, inspect, read, and test execution (Yao et al., 2023b; Schick et al., 2023). Related prompting and self-improvement methods also exploit multiple sampled reasoning paths, search trees, or self-generated rationales to expose useful intermediate information (Wang et al., 2023; Yao et al., 2023a; Zelikman et al., 2022). These actions change the agent’s belief state rather than immediately completing the task. In sparse-reward RL, such actions are easy to misclassify as neutral or wasteful. TRIAGE makes belief-state progress an explicit credit category, allowing training to preserve useful exploration while still suppressing redundant or irrelevant exploration.
On-policy distillation and token weighting.
On-policy distillation and token-importance methods refine supervision on sampled trajectories (Xu et al., 2026b, a; Agarwal et al., 2024). These methods mostly operate at token or response granularity. TRIAGE operates at the agentic segment level and can be applied to either RL advantages or distillation losses: role labels can gate which action turns receive strong distillation or reinforcement.
8 Conclusion
We argued that agentic credit assignment requires distinguishing what role each environment-facing segment plays. The key missing distinction is that exploration is not no-progress: an action can improve the agent’s belief state without immediately completing a subgoal. TRIAGE operationalizes this idea with a structured role judge and role-conditioned credit rules, keeping the GRPO outcome advantage as the optimization direction while adding a bounded, role-typed correction. Across ALFWorld, Search-QA, and WebShop, this lifts success rates over GRPO for two policy models—by up to points on Qwen2.5-7B and on Qwen3-1.7B—and shortens completed rollouts by –, with ablations and a manual role audit confirming that suppressing regression inside successful trajectories is the dominant source of the gain. Theoretically, role-conditioned credit is the MSE-optimal correction expressible from role labels alone, so the benefit is tied directly to judge reliability, which our audit measures rather than assumes. By reinforcing decisive progress, preserving useful exploration, damping no-progress infrastructure, and suppressing regression, TRIAGE offers a principled path toward sparse-reward RL for agents whose success depends on information gathering and recovery.
References
- On-policy distillation of language models: learning from self-generated mistakes. In International Conference on Learning Representations, Cited by: §7.
- DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Nature 645, pp. 633–638. External Links: Document Cited by: §1.
- SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179. Cited by: §5.1.
- Rubric-based on-policy distillation. arXiv preprint arXiv:2605.07396. Cited by: §7.
- Group-in-group policy optimization for LLM agent training. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.4, §7, Table 7.
- Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, pp. 1471–1530. Cited by: Appendix B.
- Let’s verify step by step. In International Conference on Learning Representations, Cited by: §2, §7, §7, Table 7.
- Self-distilled agentic reinforcement learning. arXiv preprint arXiv:2605.15155. Cited by: §1.
- Self-refine: iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651. Cited by: §7.
- Toolformer: language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761. Cited by: §7.
- High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), Cited by: Appendix B.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix D.
- DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1.
- HybridFlow: a flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys), Cited by: §5.1.
- Reflexion: language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366. Cited by: §7.
- ALFWorld: aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, Cited by: §5.1.
- ReFT: reasoning with reinforced fine-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, pp. 7601–7614. External Links: Document Cited by: §1.
- SPA-RL: reinforcing LLM agents via stepwise progress attribution. arXiv preprint arXiv:2505.20732. Cited by: §1, §7, Table 7.
- Math-Shepherd: verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: Appendix D.
- Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations, Cited by: §7.
- Beyond GRPO and on-policy distillation: an empirical sparse-to-dense reward principle for language-model post-training. arXiv preprint arXiv:2605.12483. Cited by: §1, §7.
- TIP: token importance in on-policy distillation. arXiv preprint arXiv:2604.14084. Cited by: §1, §7.
- Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §5.1.
- WebShop: towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, Cited by: §5.1.
- Tree of thoughts: deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, Cited by: §7.
- ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations, Cited by: §7.
- DAPO: an open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Cited by: §1.
- STaR: bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465. Cited by: §7.
Appendix A Additional Theory and Proofs
Proof of Proposition 1.
Minimizing over all -measurable is an projection of onto the subspace of -measurable functions; the minimizer is the conditional expectation . Uniform GRPO is the special case , with MSE . By the law of total variance, . ∎
Proposition 2 (MSE reduction under fixed constants).
With the fixed role correction , the batch MSE satisfies
| (10) |
with . TRIAGE strictly reduces MSE iff and , with optimum and maximal reduction .
Proof of Proposition 2.
Expand and average over the batch; the correction is a convex quadratic in , minimized at . ∎
Appendix B Extended Theoretical Discussion
This appendix expands the short discussion following Proposition 2: why the fixed constants should align with the residual, how the correction connects to policy-gradient variance, and when the argument fails.
Alignment of fixed constants.
The covariance is maximized when the role constants match the sign pattern of the optimal correction . In the two conflict cells, this means assigning negative credit to segments inside successful trajectories, which GRPO would otherwise over-credit, and positive credit to useful segments inside failed trajectories, which GRPO would otherwise over-punish. The constants implement this ordering without per-environment tuning.
From estimation error to policy-gradient variance.
The target of training is policy improvement, not estimation accuracy per se. The bridge is standard: in policy-gradient estimators, adding any action-history–measurable baseline to the advantage leaves the gradient unbiased while changing its variance, and the variance-minimizing baseline is the conditional expectation of the return [Greensmith et al., 2004, Schulman et al., 2016]. Role labels are functions of the local action–observation window, hence admissible baselines; Proposition 1 identifies the role-measurable correction that minimizes residual energy, and Proposition 2 shows the fixed-constant surrogate reduces it whenever aligned. Because TRIAGE additionally whitens within the batch (Eq. 4), only the sign and relative ordering of the correction must be correct—an order-preserving transform of an aligned correction remains aligned (Appendix F).
Remark 3 (Where the assumption fails).
The benefit hinges on role labels capturing a nontrivial share of the credit residual () and on the judge recovering accurately enough to keep . Both can fail: (i) role is only a partial summary of local credit, so leaves residual error—e.g. the boundary is genuinely ambiguous (Table 4, -in-success F1 ); and (ii) an unreliable judge can drive , in which case no helps. This is the theoretical counterpart of the no-think judge degrading TRIAGE below GRPO (Table 3) and of the degradation at large and (Table 10). We therefore present these results as a justification conditional on judge reliability, which our audit (Section 5, Appendix H) measures directly rather than assumes.
Appendix C Training Hyperparameters
| Method | Steps | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GRPO | 8 | 150 | 0.2 | – | – | 0.01 | – | 4096 | 512 | 64 | |
| TRIAGE | 8 | 150 | 0.2 | 0.2-0.4 | – | 0.01 | 4096 | 512 | 64 |
Computational overhead.
TRIAGE adds an LLM judge call per segment during training, which increases per-batch wall-clock time. However, the relevant comparison is not raw compute parity but whether the same compute spent on additional GRPO training yields equivalent gains. In our experiments, the GRPO baseline is already near saturation at 150 steps: extending training to 300 steps yields ALFWorld success below 85% and WebShop below 75%, still short of the TRIAGE results ( and respectively). The performance plateau is expected because the credit-assignment bottleneck is structural—broadcasting a single trajectory advantage over 10–30 segments dilutes gradient regardless of how many optimization steps are taken—and more steps cannot fix a noisy per-segment signal.
From a long-rollout perspective, the LLM judge is also structurally advantageous in several respects: (i) credit dilution worsens with trajectory length, so the marginal value of correct per-segment attribution grows with the number of segments; (ii) unlike a learned value critic (as in PPO), the LLM judge generalizes zero-shot across environments without requiring environment-specific training data or reward-model fitting; and (iii) the judge leverages semantic reasoning about task goals, information gain, and state corruption that a scalar critic trained on sparse binary rewards cannot easily acquire. Thus, while the judge adds inference cost, it addresses a qualitatively different bottleneck than the one more training steps would solve.
Appendix D Shared-Backbone Value Baseline
To isolate the contribution of role typing from the contribution of any dense per-segment signal, we compare TRIAGE against a shared-backbone value baseline. This baseline keeps the GRPO policy update but attaches a learned scalar value head to the same policy backbone and trains it on the same on-policy rollouts. The recipe follows the standard actor–critic instantiation used in PPO-style RLHF [Schulman et al., 2017] and the outcome-supervised value learning popularized by Wang et al. [2024], adapted to the agentic segment setting.
Architecture.
The value head is a single linear projection on top of the final-layer hidden state of the policy backbone, evaluated at the last token of each segment’s observation. The backbone is shared with the policy and kept frozen throughout training, so only (a few thousand parameters) receives gradients. This avoids a separate critic network and keeps the additional wall-clock cost negligible relative to GRPO.
Labels: no extra annotation required.
We do not collect any process-level labels and do not call an external judge. The value head is supervised on per-segment discounted Monte-Carlo returns derived from the same binary verifier reward GRPO already computes,
| (11) |
where is the number of environment-facing segments in trajectory . The head is trained by mean-squared regression jointly with each GRPO step on the freshly collected rollouts. This is the same outcome-only supervision Math-Shepherd-style PRMs use, but with the policy backbone shared rather than a separate model fitted on logged data.
Mixing into GRPO.
At credit-assignment time the head’s per-segment value increment is added to the trajectory advantage and whitened with the same batch statistics as TRIAGE before broadcasting to segment tokens:
| (12) |
where is an exponential-moving-average copy of used to decouple value updates from policy updates.
Hyperparameters.
GRPO parameters (, , optimization steps, clip ratio , KL coefficient , , , batch size ) are shared with TRIAGE (Table 8). Value-head–specific settings: discount for ALFWorld and WebShop and for Search-QA (reflecting its shorter answer-terminating rollouts); head learning rate ; 10-step head warmup at so converges to a reasonable baseline before being injected into the policy update; EMA target update rate ; per-segment value increment clipped to to bound early-training noise; mixing coefficient matched to TRIAGE’s value per benchmark ( on Search-QA, on ALFWorld and WebShop), so any performance difference reflects the source of the dense signal rather than its scale.
What this baseline isolates.
Both TRIAGE and the shared-backbone value baseline add a bounded, -scaled dense per-segment correction on top of the same GRPO advantage; both whiten within the batch; both use only labels that the GRPO loop already produces (verifier rewards alone for the value baseline, verifier rewards plus role labels from a small judge for TRIAGE). The remaining methodological difference is the source of the per-segment signal: a learned scalar critic regressing trajectory-level outcomes, versus a semantic role classifier with role-conditioned credit rules. Table 5 shows that the value baseline improves over GRPO on the two longer-rollout environments (ALFWorld , ; Search-QA , ) but barely moves WebShop (, well inside run-to-run variance), while TRIAGE reaches //. The per-benchmark gap to TRIAGE (//) is largest precisely on WebShop, where regressions take the form of re-clicks of an already-selected attribute that leave the visible observation almost unchanged; the value head therefore receives near-identical Monte-Carlo targets for the productive click and its redundant repeat and credits them near-identically, while the role classifier reads the action history and labels the repeat . The pattern is consistent with the intended interpretation: outcome-trained scalar critics capture coarse per-segment progress when the observation actually evolves, but cannot supply role-level distinctions in action spaces where harmful repetitions leave the local state intact.
Appendix E Rollout Efficiency
Because TRIAGE suppresses no-progress infrastructure and regression, trained policies should complete tasks with fewer environment-facing actions than GRPO. Table 9 measures rollout length as the number of action–observation segments per completed evaluation trajectory.
| Environment | Starting policy | GRPO length | TRIAGE length |
|---|---|---|---|
| ALFWorld | 43.9 | ||
| WebShop |
The length results show that both RL methods learn shorter trajectories than the starting policy, but TRIAGE removes more redundant interaction steps than GRPO. On ALFWorld, GRPO reduces the average completed-trajectory length from to segments, while TRIAGE further reduces it to , an additional reduction relative to GRPO. On WebShop, GRPO reduces rollout length from to segments, while TRIAGE reaches , an additional reduction. This matches the intended mechanism of role-conditioned credit: suppressing repeated inspections, redundant attribute clicks, and other no-progress or regressive segments improves not only success rate but also interaction efficiency. The effect is especially important for long-horizon agents, where every unnecessary environment-facing step compounds inference cost and increases the opportunity for later mistakes.
Appendix F Sensitivity to Role Constants and
The main text fixes the role constants and tunes only the mixing coefficient per environment, with selected on the training split alone (Section 4). This appendix probes how sensitive TRIAGE is to these choices along the two axes that matter most for the conflict cells: the magnitude of the regression penalty and the overall mixing strength . The sweeps below are post-hoc diagnostics computed on the test set after was already fixed; they characterize robustness and were not used to select any reported hyperparameter.
All runs use Qwen2.5-7B-Instruct with the default Qwen3-8B-thinking judge; every other hyperparameter is held at its main-text value.
Joint sweep.
Table 10 sweeps against on WebShop, keeping fixed. The default configuration (, ) is highlighted.
Success rate is stable across the interior of the grid and degrades only at the corners, where either an overly large penalty () or an overly strong mixing () begins to over-punish segments the judge mislabels as .
Varying at the default .
Isolating at the per-environment default confirms the same robustness on the two environments not covered by the WebShop grid above. Extending the zero-penalty ablation of Table 6 to halved, default, and doubled penalties, ALFWorld success for is /// and Search-QA is ///, where reproduces the “no regression penalty” row of Table 6 and is the TRIAGE default. The corresponding WebShop trend is the row of Table 10 (// for ). In all three environments, halving retains most of the gain while doubling it stays above GRPO but begins to erode performance, consistent with heavier punishment of misjudged exploration in the more under-explored Search-QA setting.
Takeaway.
The sensitivity results support two conclusions. First, TRIAGE does not rely on a knife-edge choice of : both the half-penalty and default settings remain well above GRPO and the ablation.
Second, performance degrades when the role correction becomes too aggressive, especially at larger and doubled , matching the expected failure mode of over-penalizing judge false positives for . We therefore use the default constants as a conservative operating point rather than as a heavily tuned optimum.
Interaction with batch whitening.
Equation (4) whitens the combined advantage within each batch before broadcasting it to tokens. A natural concern is that a batch containing many large negative corrections could shift and inflate enough to undo the intended penalty.
Two properties bound this effect. First, whitening is an order-preserving affine map: subtracting a common and dividing by a positive cannot reverse the relative ordering of two segments, so a segment that received a lower combined advantage because it was labeled still receives a lower normalized advantage than its non- peers in the same outcome group. The whitening rescales the magnitude of the correction but never flips its sign.
Second, the correction is deliberately small relative to the outcome advantage: with and the audited role distribution, the role term contributes a raw standard deviation of only – (Section 4), so it perturbs rather than dominates and .
Empirically, the interior stability of Table 10 confirms that whitening does not cancel the role signal across the operating range we use; degradation appears only when or is pushed to the grid corners, exactly where the unnormalized correction grows large enough to compete with the outcome advantage.
Appendix G Empirical Role Distribution Audit on Logged Trajectories
Setup.
We sampled six trajectories from production GRPO baseline runs of Qwen2.5-7B-Instruct: three from ALFWorld and three from WebShop. Trajectories were chosen to span the observed outcome distribution rather than randomly: a clean efficient success, a long success containing redundant action repeats, and (where available) a failure where the agent committed early to an incorrect product or container. These six trajectories are a subset of the hand-labeled set in Appendix H; we reuse its adjudicated per-segment role labels, which were produced by two annotators who did not participate in defining the four-role taxonomy of Section 3 ( = decisive progress, = useful exploration, = no-progress infrastructure, = regression) and adjudicated by a senior annotator, and we apply that taxonomy to every environment-facing segment. The audit below focuses on ALFWorld and WebShop trajectories with complete per-segment logs; Search-QA examples are audited separately in Appendix H.
G.1 ALFWorld Trajectories
A1. Clean optimal trajectory.
Task: “put a clean butterknife in diningtable”. Outcome: success, 6 steps, raw environment reward 10. Role distribution: . Table 11 shows the per-segment role assignment. This trajectory contains a single segment (the initial location guess) and five segments completing the task.
Vanilla GRPO: broadcasts uniformly to all six segments. With no redundant or regressive segments to absorb credit, this is essentially the right behavior. TRIAGE: under the hand-audited roles, the role-conditioned rule adds to the five segments and to the initial . Net effect is a slight concentration of credit onto the decisive segments. This is the regime in which TRIAGE and vanilla GRPO behave nearly identically; the point of including this trajectory is to confirm that role-conditioning does not hurt when the trajectory is already efficient.
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | go to countertop 1 | First location, no prior evidence of butterknife position | |||
| 1 | take butterknife 2 from countertop 1 | Target object acquired | |||
| 2 | go to sinkbasin 1 | Navigate to required clean facility | |||
| 3 | clean butterknife 2 with sinkbasin 1 | Required transformation | |||
| 4 | go to diningtable 1 | Navigate to destination | |||
| 5 | move butterknife 2 to diningtable 1 | Final placement, reward triggers |
A2. Lucky-recovery success.
Task: “put a toiletpaper in toiletpaperhanger”. Outcome: success in 22 steps, raw environment reward 10. Role distribution: (Table 12). The agent does not find the target until step 17 and spends the prior 16 steps re-examining the same toilet, returning to already-visited locations, and repeating inventory checks. Ten segments are clearly redundant repeats meeting the operational definition of ; the final four are completing the task; seven are (genuine first-time inspections that yielded information); one is (an empty-handed traversal).
Vanilla GRPO: applies positive uniformly to all 22 segments because the trajectory eventually succeeded. The 10 segments—repeated examine toilet 1, inventory, back-and-forth between two locations—all receive the same positive reinforcement as the four decisive segments at the end. This is exactly the failure mode above: success masks regression in hindsight credit. TRIAGE: under the hand-audited roles, steps 4, 6, 9–16 are and receive the negative process reward from Section 4, which lowers their segment advantage even though . The preserved positive credit concentrates on the four closing segments and the genuine segments earlier in the trajectory. Net effect: the trajectory contributes the same outcome signal but roughly of its segment positions carry the bulk of the gradient, against under vanilla GRPO.
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | go to toilet 1 | Semantic guess, no prior info | |||
| 1 | examine toilet 1 | First inspection, info gain | |||
| 2 | inventory | Initial hand check | |||
| 3 | go to toiletpaperhanger 1 | Empty-handed traversal | |||
| 4 | examine toilet 1 | Repeat of step 1 | |||
| 5 | examine toiletpaperhanger 1 | First inspection of hanger | |||
| 6 | examine toilet 1 | Repeat | |||
| 7 | go to countertop 1 | New location attempt | |||
| 8 | examine countertop 1 | First inspection | |||
| 9 | go to toilet 1 | Repeated return, no new info | |||
| 10 | examine toilet 1 | Repeat | |||
| 11 | go to countertop 1 | Repeat | |||
| 12 | go to toilet 1 | Repeat | |||
| 13 | examine toilet 1 | Repeat | |||
| 14 | inventory | Repeat, hand still empty | |||
| 15 | examine toilet 1 | Repeat | |||
| 16 | examine toilet 1 | Repeat | |||
| 17 | go to cabinet 1 | First container attempt | |||
| 18 | open cabinet 1 | Reveals toiletpaper | |||
| 19 | take toiletpaper 1 from cabinet 1 | Target acquired | |||
| 20 | go to toiletpaperhanger 1 | Navigate to destination | |||
| 21 | move toiletpaper 1 to toiletpaperhanger 1 | Final placement, reward triggers |
A3. Pathological loop with lucky recovery.
Task: “put a cool apple in garbagecan”. Outcome: success in 34 steps, raw environment reward 10. Role distribution: (Table 13). The agent enters a tight loop of 15 consecutive examine fridge 1 actions (steps 2–16) without any state change, then explores other containers for another 12 steps before acquiring the target apple at step 29 and completing the task at step 33.
Vanilla GRPO: a single positive trajectory advantage is broadcast to all 34 segments, including the 15-step examine fridge 1 loop, providing direct gradient encouragement for the policy to repeat no-op observations. This is the most acute illustration in our sample of success masking regression in hindsight credit. After thousands of such trajectories, the resulting policy would be biased toward repeating idle inspections at the start of every task. TRIAGE: under the hand-audited roles, steps 2–16 (the entire loop), steps 21–24 (alternating cabinet re-examines), and step 27 (countertop re-examine) are and receive lower segment advantages with . The remaining positive role-reward mass concentrates on the genuine first-time exploration (steps 0, 1, 17, 18, 20, 25, 26, 28) and the five decisive segments at the end (29–33). Net effect: of 34 segments, 5 carry strong positive credit and 8 carry moderate information-gain credit, against 34 carrying uniform positive credit under vanilla GRPO. Under the hand-audited roles, the trajectory contributes the same outcome signal but stops teaching the policy to enter the examine fridge 1 loop.
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | go to fridge 1 | Semantic guess (fridge is the cool facility) | |||
| 1 | open fridge 1 | Reveals contents; apple not inside | |||
| 2 | examine fridge 1 | Repeat of step 1 inspection | |||
| 3 | examine fridge 1 | Repeat | |||
| 4 | examine fridge 1 | Repeat | |||
| 5 | examine fridge 1 | Repeat | |||
| 6 | examine fridge 1 | Repeat | |||
| 7 | examine fridge 1 | Repeat | |||
| 8 | examine fridge 1 | Repeat | |||
| 9 | examine fridge 1 | Repeat | |||
| 10 | examine fridge 1 | Repeat | |||
| 11 | examine fridge 1 | Repeat | |||
| 12 | examine fridge 1 | Repeat | |||
| 13 | examine fridge 1 | Repeat | |||
| 14 | examine fridge 1 | Repeat | |||
| 15 | examine fridge 1 | Repeat | |||
| 16 | examine fridge 1 | Repeat (15th consecutive examine fridge) | |||
| 17 | go to cabinet 1 | First container switch, info gain | |||
| 18 | open cabinet 1 | First inspection of new container | |||
| 19 | examine cabinet 2 | Inspect without arriving at the cabinet | |||
| 20 | open cabinet 2 | First inspection of cabinet 2 | |||
| 21 | examine cabinet 1 | Repeat of step 18 | |||
| 22 | examine cabinet 2 | Repeat of step 20 | |||
| 23 | examine cabinet 1 | Repeat | |||
| 24 | examine cabinet 2 | Repeat | |||
| 25 | go to countertop 1 | New location | |||
| 26 | examine countertop 1 | First inspection | |||
| 27 | examine countertop 1 | Repeat | |||
| 28 | go to countertop 2 | New location | |||
| 29 | take apple 1 from countertop 2 | Target acquired | |||
| 30 | go to fridge 1 | Navigate to cool facility | |||
| 31 | cool apple 1 with fridge 1 | Required transformation | |||
| 32 | go to garbagecan 1 | Navigate to destination | |||
| 33 | move apple 1 to garbagecan 1 | Final placement, reward triggers |
G.2 WebShop Trajectories
W1. Clean optimal trajectory.
Task: “Find me hand wash men’s sleep & lounge with long sleeve, elastic waistband, color: multi 9, size: medium, price $80”. Outcome: success in 6 steps. Role distribution: (Table 14). The agent issues a well-formed search query containing all task constraints, clicks the first returned product for inspection, selects the matching color and size attributes, and clicks buy now. A duplicate buy now after task completion is the only no-progress () segment.
Vanilla GRPO: applies positive credit uniformly to all six segments; the duplicate buy now receives the same reinforcement as the three genuine decisive clicks. TRIAGE: under the hand-audited roles, the role-conditioned rule adds on the three verifier-facing segments, on the initial search and product inspection, and a small negative local correction on the post-completion duplicate; net effect is a slight credit concentration with no behavior change at this trajectory’s outcome level. As with A1, this trajectory exists to confirm that TRIAGE does not degrade efficient short rollouts.
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | search[hand wash men’s sleep & lounge …multi] | Spec-aligned initial search | |||
| 1 | click[b09nd8p2qr] | Initial product inspection | |||
| 2 | click[multi 9] | Color attribute match | |||
| 3 | click[medium] | Size attribute match | |||
| 4 | click[buy now] | Reward triggers | |||
| 5 | click[buy now] | Post-completion duplicate, harmless |
W2. Long success with redundant attribute clicks.
Task: “Find me home office furniture sets, color: navy red, shape: round, size: 3’7” x 5’2”, price $70”. Outcome: success in 13 steps, raw environment reward 10. Role distribution: (Table 15). After all attributes are selected by step 4, the agent re-clicks the same three attributes (size, shape, color) four more times before finally clicking buy now at step 9, then clicks buy now two more times after the purchase is recorded.
Vanilla GRPO: applies positive credit to all 13 segments. The five redundant attribute re-clicks at steps 5–8 and 10 receive the same reinforcement as the genuine attribute selection at steps 2–4 and the buy now at step 9. Training on many such trajectories teaches the policy a wrong lesson: that re-clicking already-selected attributes is part of the successful template. TRIAGE: under the hand-audited roles, steps 5, 6, 7, 8, 10 are and receive lower segment advantages through the bounded correction . Net effect: instead of 13 segments sharing the outcome credit equally, the four segments (containing the actual purchase logic) receive relatively higher segment advantages, while the redundant re-clicks receive lower relative credit. This trajectory is the most concrete WebShop instance of success masking regression because the wrong-lesson risk is quantitatively measurable: each redundant attribute re-click under vanilla GRPO contributes the same positive log-likelihood gradient as a legitimate action.
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | search[home office furniture sets navy red round …] | Spec-aligned initial search | |||
| 1 | click[b07fkgqkz1] | Initial product inspection | |||
| 2 | click[3 ft 7 in x 5 ft 2 in] | Size selected | |||
| 3 | click[round] | Shape selected | |||
| 4 | click[navy red] | Color selected; all attributes set | |||
| 5 | click[3 ft 7 in x 5 ft 2 in] | Redundant size re-click | |||
| 6 | click[round] | Redundant shape re-click | |||
| 7 | click[navy red] | Redundant color re-click | |||
| 8 | click[3 ft 7 in x 5 ft 2 in] | Second redundant size re-click | |||
| 9 | click[buy now] | Reward triggers | |||
| 10 | click[navy red] | Post-purchase attribute re-click | |||
| 11 | click[buy now] | Post-completion duplicate | |||
| 12 | click[buy now] | Post-completion duplicate |
W3. Failure from early commit to wrong product.
Task: “Find me non slip desks for living room, color: christmasgoo3302, size: 19.7x31.5in+19.7x63in, price $50”. Outcome: failure in 11 steps, raw environment reward 0. Role distribution: (Table 16). The initial search returns a Christmas kitchen mat (B09CQ45ZRB) as the top result. The agent clicks it at step 1, incorrectly committing to a non-desk product. Subsequent steps issue two reformulated searches that re-rank the same item to the top, and the agent clicks the same wrong product again at step 6. Steps 7–10 attempt attribute clicks and a purchase against the wrong product. The bottleneck error is step 1; the second-chance failure is step 6.
Vanilla GRPO: applies negative credit uniformly to all 11 segments because . This includes the two legitimate recovery search attempts at steps 4 and 5, which the agent should be encouraged to take after recognizing the wrong commitment. Uniform negative reinforcement teaches the policy to avoid recovery search-after-mistake, the exact opposite of the desired behavior. TRIAGE: under the hand-audited roles, steps 0, 4, 5 are (legitimate exploration: initial good-faith search and two recovery attempts). Under the rule in Section 4, in a failed trajectory receives the bounded process reward rather than only the negative outcome credit. Steps 1, 6 (both clicks of the wrong product) are and receive strong negative credit from . This illustrates outcome-mixed exploration: the recovery searches at steps 4–5 are useful exploration appearing inside a failure trajectory, and outcome-only credit assigns them the same negative sign as the wrong-product clicks. Net effect: the policy learns “do not click the wrong product twice” (the steps 1 and 6 lesson) without also learning “do not re-search after a mistake” (the spurious lesson vanilla GRPO would teach).
| Action | Hand | Justification | Judge | Agree | |
|---|---|---|---|---|---|
| 0 | search[non-slip desk Christmasgoo3302 …] | Good-faith spec-aligned search | |||
| 1 | click[b09cq45zrb] | Wrong product type (Christmas kitchen mat, not a desk) | |||
| 2 | click[19.7x31.5in+19.7x63in] | Attribute click on wrong product | |||
| 3 | click[19.7x31.5in+19.7x63in] | Redundant repeat | |||
| 4 | search[non slip desk christmasgoo3302 …] | Recovery attempt: re-search | |||
| 5 | search[non slip desk color: christmasgoo3302 …] | Recovery attempt: refined search | |||
| 6 | click[b09cq45zrb] | Re-clicks same wrong product | |||
| 7 | click[christmasgoo3302] | Attribute click on wrong product | |||
| 8 | click[christmasgoo3302] | Redundant | |||
| 9 | click[19.7x31.5in+19.7x63in] | Attribute click on wrong product | |||
| 10 | click[buy now] | Purchases wrong product, reward stays 0 |
G.3 Aggregate Observations
Table 17 summarizes the role distribution in the six audited trajectories.
| Environment | ||||
|---|---|---|---|---|
| ALFWorld | 23% | 26% | 3% | 48% |
| WebShop | 23% | 23% | 10% | 43% |
The main takeaway is that regression is common in these logged rollouts, especially as redundant repetition rather than irreversible state corruption. Several successful trajectories contain substantial mass, so vanilla GRPO would still broadcast positive credit to repeated inspections or repeated attribute clicks. This makes -in-success the most important diagnostic cell for TRIAGE and motivates calibrating the role-conditioned mixing coefficient on a small per-environment annotated sample.
Appendix H Judge Model Audit: Side-by-Side Hand vs Qwen3-8B-Thinking Labels
Role-judge context window.
For a segment , the training-time role judge sees a bounded local window around that segment. In our implementation, the window contains the task goal, up to five previous action–observation pairs , the current action , the immediate resulting observation , and up to five future action–observation pairs when they exist. Boundary cases use the available prefix or suffix.
The short future window helps identify whether an exploratory segment enabled later progress or whether an apparently harmless step was redundant. We do not feed the entire trajectory to every segment-level judge call because long inputs make repeated high-quality judging expensive and empirically make the classifier less focused on the local causal role.
Controlling the input length keeps the role classifier usable at segment scale and reduces the chance that it relies on distant recovery patterns instead of the current action. The judge still does not receive the final verifier outcome or an unbounded future trajectory, so the role label diagnoses local causal behavior rather than copying the trajectory-level reward that GRPO already supplies.
Setup.
We audit a Qwen3-8B judge with thinking mode enabled on 18 logged trajectories (9 success, 9 failure) across three environments: 3 ALFWorld (captured from the trained GRPO policy), 3 WebShop (trained policy), and 12 Search-QA (base-model rollouts to obtain failure-rich data). To keep the ground truth independent of the rubric design, two annotators who did not participate in defining the role taxonomy of Section 3 each labeled all 135 segments independently. The two annotators reached 88.1% raw label agreement (119 of 135 segments); segments on which they disagreed were adjudicated by a senior annotator, and the adjudicated labels are used as ground truth. For each audited segment, the judge was given the same bounded window used during training: the task, up to five previous action–observation pairs, the current action and immediate observation, and up to five future action–observation pairs when available. The judge was not given the final verifier outcome or the unbounded full trajectory. It was asked to output one role for the current segment using the Qwen3 chat-template enable_thinking=True flag. All inference used temperature . Together with the merged ALFWorld and WebShop tables in Appendix G, this appendix reports every trajectory with both hand and judge labels per step. Aggregate judge metrics are reported in Table 4.
Judge prompt.
The audit used the following role-classification prompt. We require a short evidence string for every segment, which forces the judge to ground each label in the local action–observation context rather than emitting only a free-floating role tag; in practice this makes label audits easier and improves judge consistency.
You are an expert evaluator of multi-turn agent trajectories.
You will see a local window around one target segment: the task, up to
five previous action-observation pairs, the CURRENT action and observation,
and up to five future action-observation pairs. Classify only the CURRENT
action into ONE of four roles:
D (DECISIVE) The action completes a required sub-goal or makes a
verifier-checkable state change directly required by
the task (e.g. takes the target object, performs a
required transformation like cool/heat/clean, places
the target in the destination, executes the final
purchase, selects a task-mandated attribute).
E (EXPLORATION) The action gathers information or visits a new state
for the first time without completing a sub-goal.
First-time inspection of a container, first navigation
to a candidate location, an initial search query,
a refined search after recognizing a wrong commitment.
N (NO-PROGRESS) The action neither changes the task state nor reveals
new information. Empty-handed traversal, harmless
duplicate after task completion, generic navigation
through an irrelevant location with no investigation.
R (REGRESSION) Clear setback: the action either corrupts state,
picks the wrong object, commits to a non-matching
product, performs the wrong transformation, OR is a
redundant repeat of an already-completed action that
yields no new information ("examine X" when X was just
examined; re-click of an already-selected attribute;
re-purchase after success).
CALIBRATION RULES
- Judge LOCAL causal role using only the supplied window. Do not infer
credit from distant recovery or distant failure outside the window.
- For the current step, provide brief evidence grounded in the local
action/observation, e.g. "first inspection reveals new object",
"repeat with no new information", or "correct target acquired".
- First-time examine/inspect = E. Second-time examine of the same
target without state change = R.
- "Nothing happens." in observation means the action was invalid;
if action repeats, label R.
- A buy/place/take/heat/cool of the correct target = D.
- Re-click of already-selected attribute = R, even if the local observation
reports success.
OUTPUT FORMAT
After your reasoning, output ONLY a JSON object on a single line at
the very end:
{"labels": ["D"|"E"|"N"|"R", ...], "evidence": ["short reason per step", ...]}
Both lists must have length equal to the number of steps shown.
ALFWorld and WebShop trajectories.
The six ALFWorld and WebShop trajectories audited here (A1–A3, W1–W3) are the same rollouts analyzed in Appendix G. To avoid duplicating their per-step action listings, their hand labels, Qwen3-8B-thinking judge labels, and per-step agreement are reported together with the role-distribution analysis in Tables 11–16 (judge agreement per trajectory is stated in each caption). The Search-QA trajectories below are audited only here.
Search-QA trajectory summary.
Table 18 summarizes all 12 Search-QA audit trajectories. The table keeps the outcome, question, number of search turns, final answer, hand-label sequence, judge-label sequence, and agreement count; Table 19 then gives the only Search-QA disagreement case step by step.
| ID | Outcome | Question / final answer | Pattern | Hand labels | Judge labels | Agreement |
|---|---|---|---|---|---|---|
| SQ-F1 | fail | first Nobel Prize in Physics / Wilhelm Röntgen | E,E,R | E,E,R | 3/3 | |
| SQ-F2 | fail | next Deadpool movie release / Deadpool 3 | E,E,E,R | E,E,E,R | 4/4 | |
| SQ-F3 | fail | short-wave broadcast mode / AM | E,E,E,R | E,E,E,R | 4/4 | |
| SQ-F4 | fail | southwest wind across Nigeria / February and June | E,E,E,R | E,E,E,R | 4/4 | |
| SQ-F5 | fail | first declaration of human rights / John Peters Humphrey | E,E,R,R | E,E,E,R | 3/4 | |
| SQ-F6 | fail | next Scandal episode / April 19, 2018 | E,E,R,R | E,E,R,R | 4/4 | |
| SQ-F7 | fail | Philadelphia last Super Bowl win / 2018 | E,E,R | E,E,R | 3/3 | |
| SQ-F8 | fail | first lady nominated to Rajya Sabha / Rajvanshi Devi | E,R,R,R | E,R,R,R | 4/4 | |
| SQ-S1 | success | Swan Lake, Sleeping Beauty, Nutcracker composer / Pyotr Ilyich Tchaikovsky | E,E,D | E,E,D | 3/3 | |
| SQ-S2 | success | Dragon Ball Z episode count / 291 | E,E,D | E,E,D | 3/3 | |
| SQ-S3 | success | garden city of New Earswick designer / Raymond Unwin | E,E,E,D | E,E,E,D | 4/4 | |
| SQ-S4 | success | filming location of The Curse of Oak Island / Oak Island | E,E,D | E,E,D | 3/3 |
Representative Search-QA disagreement.
SQ-F5 is the only Search-QA trajectory in this audit where Qwen3-8B-thinking disagrees with the adjudicated hand labels. The disagreement is instructive: the third search shifts from the human-rights question to a declaration-of-independence query, so annotators mark it as regression, while the judge still treats it as exploration.
| Action | Hand | Qwen3-8B-think | Agree | |
|---|---|---|---|---|
| 0 | <search>who wrote the first declaration of human rights</sea | |||
| 1 | <search>who wrote the first declaration of human rights decl | |||
| 2 | <search>who wrote the declaration of independence</search> | |||
| 3 | <answer>John Peters Humphrey</answer> |
Audit blind spot.
None of the nine failure trajectories in this set contain any hand-labeled segment. This is a structural property of the calibration set rather than a sampling artifact: WebShop W3 commits to the wrong product at step 1 and accumulates only thereafter, and the eight Search-QA failures all terminate with a wrong answer () after a sequence of search queries (). The cell “ in failed rollouts” is therefore not measurable on this calibration set. Verifying that the judge correctly identifies decisive intermediate progress within failed trajectories (for example, an ALFWorld agent that correctly heats the target object but then places it in the wrong receptacle, or a Search-QA agent that correctly identifies the bridge entity but issues a malformed final answer) is the principal extension required of a larger follow-up audit.
Appendix I Verbatim Search-QA Query Reformulations
For completeness, this section restores the exact per-step text behind the compressed label sequences of Table 18. For every one of the twelve audited Search-QA trajectories we list each environment-facing segment in order: search-turn queries followed by the final answer turn, reproduced verbatim from the logged rollout (a few strings were truncated at capture time and are marked with “…”). Each segment is tagged with its adjudicated hand role ; the single judge disagreement in this set is SQ-F5 segment 2, where the judge predicts instead of the adjudicated , marked . Segments separated by “”; the answer turn is prefixed by .
Failure trajectories.
- SQ-F1
-
(answer “Wilhelm Röntgen”) [E] first nobel prize in physics [E] Wilhelm Röntgen first Nobel Prize in Physics [R] Wilhelm Röntgen.
- SQ-F2
-
(answer “Deadpool 3”) [E] next Deadpool movie release [E] next Deadpool movie release date [E] Deadpool 3 movie release date [R] Deadpool 3.
- SQ-F3
-
(answer “AM”) [E] short wave broadcast service [E] short wave broadcast service am [E] short wave broadcast service mode [R] AM.
- SQ-F4
-
(answer “February and June”) [E] south west wind blows across Nigeria [E] south west wind blows across Nigeria between [E] south west wind blows across Nigeria between Februar… [R] February and June.
- SQ-F5
-
(answer “John Peters Humphrey”) [E] who wrote the first declaration of human rights [E] who wrote the first declaration of human rights decl… who wrote the declaration of independence [R] John Peters Humphrey.
- SQ-F6
-
(answer “April 19, 2018”) [E] “Scandal” TV show [E] Scandal next episode [R] Scandal next episode (verbatim repeat of ) [R] April 19, 2018.
- SQ-F7
-
(answer “2018”) [E] Philadelphia last Super Bowl win [E] Philadelphia Eagles last Super Bowl win [R] 2018.
- SQ-F8
-
(answer “Rajvanshi Devi”) [E] first lady nominated member of the rajya sabha [R] first lady nominated member of the rajya sabha (repeat) [R] first lady nominated member of the rajya sabha (repeat) [R] Rajvanshi Devi.
Success trajectories.
- SQ-S1
-
(answer “Pyotr Ilyich Tchaikovsky”) [E] “Swan Lake the Sleeping Beauty and the Nutcracker fa… [E] Swan Lake the Sleeping Beauty and the Nutcracker fam… [D] Pyotr Ilyich Tchaikovsky.
- SQ-S2
-
(answer “291”) [E] Dragon Ball Z episodes [E] Dragon Ball Z episodes 291 [D] 291.
- SQ-S3
-
(answer “Raymond Unwin”) [E] garden city of new earswick [E] who designed the model village of new earswick [E] Raymond Unwin designed the garden city of new earswi… [D] Raymond Unwin.
- SQ-S4
-
(answer “Oak Island”) [E] “The Curse of Oak Island filmed” [E] where is the tv show the curse of oak island filmed [D] Oak Island.