MindGames Arena Generalization Track:
In2AI Solution with Delayed Per-Step Reward Attribution
Abstract
Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (8B parameters) tracks.
Keywords Reinforcement Learning Multi-Agent Systems Language Models Reward Attribution Eligibility Gating
1 Introduction
Large language models excel at single-turn tasks (et al., 2025), yet settings that require modeling other agents’ beliefs, coordinating under uncertainty, and planning over extended interactions remain difficult (Guertler et al., 2025; Park et al., 2023). Recent work applies reinforcement learning to improve LLM agents (Ouyang et al., 2022; Shinn et al., 2023), but most methods assume single-agent, single-turn environments where reward signals are immediate and well-defined.
Multi-agent strategic interaction fundamentally violates this assumption. When language model agents interact over extended episodes, whether negotiating, deceiving, cooperating, or competing, the quality of an action depends on events beyond the agent’s control. A negotiation opening is good or bad depending on how the counterparty responds; a clue is clever or foolish depending on whether the teammate interprets it correctly; a deceptive allocation is brilliant or wasteful depending on whether the opponent takes the bait. Traditional RL frameworks that assign immediate rewards cannot capture these dependencies.
We introduce delayed per-step reward attribution with eligibility gating: rather than forcing immediate reward assignment, we (1) determine at episode end who won and compute episode-level rewards, (2) gate steps that lack sufficient training signal, and (3) attribute rewards backward according to task-specific semantics. Paired with asynchronous rollout generation, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach yields stable reinforcement learning in multi-agent settings. We evaluate on the MindGames Arena benchmark (MindGames Organizers, 2025), powered by the TextArena framework (Guertler et al., 2025), where our 8B-parameter model placed first in both the Open (unrestricted) and Efficient (8B) divisions, matching or beating teams that used GPT-5 (OpenAI, 2025a) in head-to-head play.
2 Related Work
Reinforcement learning for language models.
Reinforcement learning from human feedback (RLHF) is now the dominant method for aligning LLMs with human preferences (Ouyang et al., 2022), typically using proximal policy optimization (Schulman et al., 2017) or REINFORCE-style methods (Ahmadian et al., 2024). These methods operate in a single-turn, single-agent setting where reward is immediate and well-defined. Recent work extends RL to multi-turn settings: Shinn et al. (2023) introduce verbal self-reflection as a reinforcement signal, and Snell et al. (2024) study how scaling test-time compute can improve LLM reasoning, sometimes more effectively than scaling model parameters. DeepSeek-R1 (et al., 2025) demonstrates that pure RL can produce emergent chain-of-thought reasoning without supervised warm-up, but focuses on single-agent mathematical reasoning rather than multi-agent interaction.
Multi-agent reinforcement learning.
Training strategically interacting agents is a well-studied problem in multi-agent RL (Zhang et al., 2021). Centralized training with decentralized execution (CTDE) methods such as MAPPO (Yu et al., 2022) and QMIX (Rashid et al., 2020) address credit assignment in cooperative settings, but assume access to shared state information unavailable in natural language games. The credit assignment problem (determining which agent’s action caused a collective outcome) is closely related to our work, though we operate at the level of per-step reward attribution within a single agent’s trajectory rather than across agents.
LLM agents for games and strategic reasoning.
TextArena (Guertler et al., 2025) provides a framework for evaluating LLMs in text-based multi-agent games, and the MindGames Arena (MindGames Organizers, 2025) builds on it to benchmark strategic reasoning across cooperative, competitive, and mixed-motive settings. SPIN-Bench (Yao et al., 2025) evaluates strategic planning, interaction, and negotiation capabilities. Prior work on game-playing LLMs has largely focused on prompting strategies (Gandhi et al., 2023) or few-shot evaluation (Akata et al., 2023) rather than training agents through RL in game environments. Axelrod’s foundational work on the evolution of cooperation (Axelrod and Hamilton, 1981) and Blotto games (Roberson, 2006) provide the game-theoretic basis for two of our evaluation environments.
Curriculum learning and opponent diversity.
Self-play and population-based training (Jaderberg et al., 2019) work well for training game-playing agents but typically operate in fixed-action-space games. Our curriculum approach, which gradually introduces stronger opponents while retaining weaker ones, draws on ideas from prioritized experience replay (Schaul et al., 2016) and automated curriculum learning (Portelas et al., 2020), adapted to the LLM setting where opponent diversity shapes the distribution of natural language strategies encountered during training.
3 Game Environments
MindGames Arena (MindGames Organizers, 2025) evaluates whether language model agents can reason strategically, coordinate with partners, and adapt to opponents while communicating in natural language. The Generalization Track includes two divisions:
-
•
Efficient: Open-source models with at most 8 billion parameters.
-
•
Open: Any models without constraints on size or cost.
Agents play many matches against varied opponents, with results aggregated using TrueSkill (Herbrich et al., 2006) to produce reliable rankings.
The benchmark includes three games that span cooperative, competitive, and mixed-motive dynamics:
Codenames.
A cooperative 2v2 word-association game. A 55 grid contains 25 words: 9 Red team words, 8 Blue team words, 7 neutral words, and 1 assassin. Each team has a spymaster (who sees word assignments) and an operative (who only sees revealed words). The spymaster gives a one-word clue plus a number indicating how many team words relate to that clue (format: [clue N]). The operative then guesses up to words sequentially (format: [word] or [pass]). Guessing the assassin causes instant loss; guessing opponent or neutral words ends the turn. The first team to reveal all their words wins; if the 80-turn limit is reached, the team with more revealed words wins.
Colonel Blotto.
A two-player competitive resource allocation game. Each round, both players simultaneously allocate exactly 20 units across 3 battlefields (format: [A5 B10 C5]). The player committing more units to a field wins that field; winning the majority of fields wins the round. The match continues for up to 9 rounds, with early victory possible upon securing a majority (5+). The player winning more rounds wins the match.
Three-Player Iterated Prisoner’s Dilemma.
A mixed-motive game extending the classic Iterated Prisoner’s Dilemma to three players over 5 rounds. Each round has two phases: (1) conversation, consisting of 3 free-chat turns where players negotiate openly, and (2) decision, where each player simultaneously submits actions toward both opponents (format: [1 cooperate] [2 defect]). The payoff matrix follows the standard structure: mutual cooperation yields 3 points each; mutual defection yields 1 point each; unilateral defection yields 5 points (defector) and 0 points (cooperator). The player(s) with the highest cumulative score after all rounds wins.
4 Challenges in Agentic Workflows
Standard reinforcement learning assumes a convenient fiction: that each action can be evaluated independently, with rewards fully capturing the quality of that action at the moment it is taken. This assumption underlies everything from temporal difference updates in Q-learning (Watkins and Dayan, 1992) to per-step advantage estimates in policy gradient methods (Schulman et al., 2017). In single-agent, fully-observable environments with dense rewards, this fiction holds reasonably well.
Agentic workflows break this assumption. When language model agents interact over extended episodes, the quality of an action becomes entangled with events that have not yet occurred, with actions taken by other agents, and with information that may never be revealed.
Before presenting our solution, we describe eight specific challenges that motivated our design. These challenges extend well beyond MindGames (MindGames Organizers, 2025); they arise in any agentic workflow where action outcomes depend on future events, other agents, or information outside the actor’s control. We group them into three themes: temporal entanglement (Challenges 1, 2), where action quality depends on future events; structural asymmetry (Challenges 3, 4, 5), where position, opponent skill, or external failures create unequal learning signals; and training logistics (Challenges 6, 7, 8), where variable episode structure and heterogeneous inference demands complicate batch construction and throughput.
4.1 Challenge 1: Lose to Win
Consider the following Colonel Blotto exchange:
| Round | Alpha | Beta | Result | Post-Round Score (Alpha vs Beta) |
| 7 | [A3 B8 C9] | [A0 B10 C10] | Beta wins | 2 vs 4 |
| 8 | [A1 B11 C8] | [A0 B10 C10] | Alpha wins | 3 vs 4 |
| Alpha observes: “Beta repeats the same allocation. I’ll exploit this pattern.” | ||||
| 9 | [A1 B11 C8] | [A2 B12 C6] | Beta wins | 3 vs 5 |
| Match Result: | Beta wins | |||
Beta’s Round 8 allocation, [A0 B10 C10], identical to Round 7, lost that round. Yet this move was strategically brilliant: it established a predictable pattern that Alpha learned to exploit. When Beta broke the pattern in Round 9, Alpha’s counter-strategy backfired.
Problem: Locally suboptimal actions may be globally optimal. A reward function that evaluates each step on its immediate outcome would penalize setup actions that sacrifice short-term performance for long-term success, discouraging strategies such as deception, baiting, or delayed gratification.
Implication: Rewards must incorporate final outcome and propagate backward to preceding actions. Intermediate results alone are insufficient to evaluate action quality.
4.2 Challenge 2: Interdependent Action Rewards
Consider this Codenames episode where the team words include “ocean” and “fish”:
Spymaster gives clue: [water 2]
Operative reasons: “Options: ocean, shower, blue, fish. With 2 guesses, I’ll try ocean and fish…”
Operative guesses: [ocean] Correct! Team word revealed.
Operative reasons: “Ocean was right. Remaining: shower, blue, fish. Since ocean is blue, maybe ‘blue’ is the connection…”
Operative guesses: [blue] Assassin! Game over.
Despite being correct, the first guess [ocean] should not receive maximum reward: the turn ended in catastrophe, and all guesses and the clue share responsibility.
Problem: Actions within a logical sequence are interdependent. Early actions depend on future actions for their true value; later actions inherit context from past actions; initiating actions depend on all subsequent responses they trigger. Per-step rewards that evaluate actions independently reinforce locally correct actions even when they contribute to global failure.
Implication: All actions within a logical group must be evaluated together, with rewards computed after observing the complete sequence and attributed based on collective outcome.
4.3 Challenge 3: Positional Bias
Consider the Three-Player Iterated Prisoner’s Dilemma conversation phase, where players speak in fixed order (1 2 3) before submitting simultaneous decisions:
-
•
Player 1 (early position): Speaks first, can propose coalitions (e.g., “Let’s both defect against Player 3”), and observes reactions before the decision phase.
-
•
Player 2 (middle position): Can respond to prior proposals and add their own, receiving partial feedback.
-
•
Player 3 (late position): Speaks last. Any proposals they make are never discussed before decisions are locked in.
Problem: Turn order confers asymmetric strategic affordances. Early actors set agendas and observe responses; late actors must interpret prior commitments without any chance to clarify. An agent trained mostly from one position learns position-specific strategies that do not transfer to other positions.
Implication: Training must expose the agent to all positions uniformly to learn position-agnostic strategies. Skewed sampling produces brittle policies that break under position reassignment.
4.4 Challenge 4: Opponent and Teammate Diversity
A model that has not yet learned game rules and action formats will produce invalid outputs. If this untrained model immediately faces frontier opponents like GPT-5 (OpenAI, 2025a), every episode ends in rapid defeat before the model can observe what correct play looks like. The training signal becomes dominated by format penalties rather than strategic learning.
Conversely, training exclusively against a narrow set of opponents or teammates produces a model that exploits their specific patterns but fails against diverse real-world agents. In cooperative games like Codenames, a spymaster trained only with one type of operative learns to give clues tailored to that partner’s interpretation style, failing when paired with different teammates at evaluation time.
Problem: The training distribution of opponents and teammates must be designed along two axes. First, skill progression: opponents that are too strong yield no positive signal for a novice agent, while opponents that are too weak fail to challenge an improving one. Second, diversity: training against a homogeneous set of agents produces brittle policies that overfit to specific play styles instead of learning strategies that generalize.
Implication: Agent sampling must provide both curriculum learning and diversity. Training should start with opponents that let the model learn basic rules and formats, then gradually introduce stronger and more varied opponents as skill improves. Earlier opponents must remain in the sampling pool throughout training, both to prevent catastrophic forgetting and to ensure the model continues to beat weaker opponents while learning to compete with stronger ones.
4.5 Challenge 5: Missing Training Signal
Consider two failure scenarios:
Colonel Blotto.
Player Alpha submits a valid allocation [A5 B10 C5], but opponent Beta submits [A100 B0 C0] (invalid: sum exceeds 20). The match terminates immediately. There was no battlefield comparison, no round outcome. The action was valid, but there is no signal about whether it was strategically sound.
Codenames.
The spymaster gives a valid clue [water 2]. The operative responds with “My answer is ocean” instead of the required format [ocean]. The turn terminates due to the parsing error. The clue has no guesses to evaluate, making it impossible to determine its quality.
Problem: Some valid actions lack observable outcomes due to external failures (opponent errors, parsing failures, early termination). Assigning arbitrary rewards to these steps introduces noise; assigning zero rewards biases against exploratory actions. Neither approach provides a meaningful learning signal.
Implication: Steps without sufficient signal must be identified and excluded from training. Only actions with observable outcomes should contribute to gradient updates.
4.6 Challenge 6: Variable Episode Length
Agentic episodes exhibit high variance in length. A game may terminate early through decisive victory, end abruptly due to an invalid action, extend through prolonged negotiation toward a draw, or reach a predefined step limit. In Codenames, hitting the assassin terminates immediately, while revealing all words through careful play may take substantially longer. Table 1 illustrates this variability across our training data.
| Environment | Avg Steps Per Player | Std | Games | Min | Max |
|---|---|---|---|---|---|
| Codenames | 5.18 | 4.30 | 56 | 1 | 16 |
| Colonel Blotto | 6.48 | 1.61 | 426 | 1 | 9 |
| Three-Player IPD | 9.77 | 1.18 | 262 | 2 | 10 |
This variability compounds with eligibility filtering: not all steps in an episode qualify for training. An episode with 15 steps may yield only 8 eligible training samples after removing invalid actions and steps that lack signal. Rollout generation must collect episodes into an episodes bank, accumulating eligible steps until enough samples are available for a training batch.
Problem: Standard reinforcement learning pipelines assume fixed or predictable episode lengths for batch construction. Variable-length episodes with dynamic eligibility create uneven data flow: some rollouts produce many training samples, others produce few or none. Synchronous collection idles compute while waiting for long episodes; fixed batch sizes waste capacity or train on stale data.
Implication: Rollout generation must run asynchronously across parallel workers and respect the actual number of trainable steps each episode provides when assembling training batches.
4.7 Challenge 7: Multi-Dimensional Batch Balancing
As Table 1 illustrates, different games produce episodes of vastly different lengths. When training a single model across multiple games, unbalanced sampling causes the model to see disproportionately more steps from longer-episode games, biasing learned policies toward those environments.
Even within a single game, episode length varies substantially (note the standard deviations in Table 1). An episodes bank containing a mix of long and short episodes may have most of its steps concentrated in a few long trajectories. Sampling uniformly by step would draw disproportionately from these long episodes, reducing diversity: the model repeatedly sees correlated steps from the same trajectory rather than learning from varied situations across many episodes.
Reward distributions are also skewed. Most steps cluster around average rewards, while highly positive (excellent moves) and highly negative (critical errors) steps are rarer. Uniform sampling underrepresents these extremes, yet the model must see failures to avoid them and successes to reinforce them.
Problem: Uniform step sampling produces imbalanced coverage across three dimensions: games, episodes, and reward outcomes. Long-episode games dominate gradients; long individual episodes reduce batch diversity; skewed reward distributions cause the model to underlearn from rare but important successes and failures.
Implication: Batch construction must balance across games, episodes, and reward bins. Steps should be sampled for even game representation, broad episode coverage, and stratified reward sampling that covers bad, average, and good actions proportionally.
4.8 Challenge 8: Heterogeneous Inference Demands
Different game situations demand vastly different amounts of reasoning. A spymaster crafting a clue that connects multiple words while avoiding the assassin requires extensive chain-of-thought deliberation, while an operative selecting a single word from a short list needs far less reasoning. Similarly, negotiating a coalition in the Prisoner’s Dilemma requires more deliberation than submitting a binary cooperate/defect decision. These differences appear both across games and across roles within the same game.
When multiple workers run episodes concurrently (as required by Challenge 4.6), these heterogeneous demands create synchronization bottlenecks. Standard inference pipelines process requests in synchronized batches: all requests in a batch must complete before any results are returned. A worker generating a short response cannot proceed while another worker in the same batch deliberates extensively.
Problem: Synchronous batch inference creates idle time proportional to the variance in generation length. If workers submit requests with generation times , synchronous batching forces all workers to wait before any can proceed, wasting worker-seconds per inference round. With high variance across games and roles, this overhead compounds multiplicatively across episodes, dominating training time and reducing throughput.
Implication: The inference engine must support asynchronous request handling with continuous batching: requests should be processed as they arrive, with results returned upon individual completion rather than held for batch synchronization. Each worker must be able to proceed independently so that episodes advance at their natural pace.
4.9 Summary: A Unified Problem
These eight challenges are not independent obstacles to be addressed in isolation; they form a coupled problem stemming from the mismatch between traditional RL assumptions and the realities of agentic interaction.
The temporal entanglement challenges (Lose to Win, Interdependent Rewards) show that reward computation must be delayed: we cannot assign meaningful rewards until we observe how actions interact with future events. The structural asymmetry challenges (Positional Bias, Opponent/Teammate Diversity, Missing Training Signal) show that not all training configurations are equal: some lack valid signal entirely, others encode position-specific strategies that do not generalize, and still others arise from homogeneous opponent distributions that produce brittle policies. The logistical challenges (Variable Episode Length, Multi-Dimensional Balancing, Heterogeneous Inference) show that episode structure and inference demands are unpredictable: training pipelines must tolerate high variance in episode length, step eligibility, reward distribution, and per-step generation time.
Attempting to solve any single challenge in isolation creates new problems. Delaying rewards without filtering steps assigns arbitrary values to undefined outcomes. Filtering steps without balanced sampling biases the model toward easy cases. Balanced sampling without proper reward attribution trains on noisy gradients. Asynchronous rollout generation without continuous-batching inference introduces synchronization bottlenecks. The challenges compound, and no piecemeal fix suffices; the system must address all three themes jointly.
5 Methodology
We address the challenges outlined in Section 4 through an episode lifecycle design that separates action validation during execution from reward computation after episode completion. The central idea is that reward assignment must be delayed until sufficient information is available, and steps that lack valid training signal must be filtered from the training batch rather than assigned arbitrary rewards.
Our approach comprises two components: (1) action validation that enforces format and rule compliance during episode execution, and (2) a post-episode processing pipeline that computes rewards, gates eligibility, and attributes credit based on complete trajectory information. We describe each in turn.
5.1 Action Validation During Execution
The TextArena framework (Guertler et al., 2025) does not uniformly terminate episodes on invalid actions: some games continue with default behaviors, others ignore malformed outputs, and error handling varies across environments. There is also no unified interface for determining whether an action was incorrect: each environment reports errors differently, if at all. This inconsistency complicates reward attribution because we cannot reliably determine which steps failed or why.
To address this, we introduce an Action Validator abstraction for each environment. During episode execution, every action passes through the validator, which performs three checks:
-
•
Reasoning template compliance: Does the output follow the expected reasoning structure? (e.g., thinking inside designated tags before the final answer)
-
•
Format compliance: Does the output match the expected action pattern? (e.g., [word N] for spymaster clues, [A5 B10 C5] for Blotto allocations)
-
•
Game-rule validity: Does the action satisfy game constraints? (e.g., allocation sum in Blotto, clue not being a substring of board words in Codenames, guess being an existing word on the board)
Invalid actions terminate the episode immediately with a recorded error type. This uniform validation serves several downstream purposes: it provides the metadata needed to identify Missing Training Signal (Challenge 4.5) by detecting when valid actions lack outcomes due to external failures; it establishes clear episode boundaries for delayed reward attribution (Challenges 4.1 and 4.2); and it lets us reliably determine who won, lost, or caused premature termination. The Players Builder (Section 5.2.1) relies on this metadata to extract episode-level results, the Steps Filter uses it to gate steps that depend on failed actions, and the Reward Assigner applies appropriate penalties to invalid outputs.
5.2 Post-Episode Processing Pipeline
After episode termination, each completed episode passes through a three-stage processing pipeline (Figure 1) that transforms raw trajectories into training-ready data with per-step reward attribution. The Players Builder extracts episode outcomes for delayed evaluation; the Steps Filter gates steps that lack sufficient signal (Challenge 4.5); and the Reward Assigner computes per-step rewards with backward propagation (Challenges 4.1 and 4.2).
5.2.1 Stage 1: Players Builder
The Players Builder extracts the episode outcomes that subsequent stages require. Using the action validation metadata, it reconstructs the full episode result: not just win/loss/draw, but also how the episode ended (natural conclusion, parsing failure, or game-rule violation).
For each player, we determine:
-
•
Outcome type: Win, loss, draw, or termination due to error (parsing failure vs. invalid action)
-
•
Responsibility: Did this player cause the termination, or was it caused by an opponent or teammate?
-
•
Episode-level rewards: Episode rewards returned by TextArena (Guertler et al., 2025) are nearly boolean, identifying win or loss but providing little granularity. We compute meaningful rewards that better measure model performance. For example, in Colonel Blotto with 9 rounds and majority-win termination, we normalize by rounds won: 1.0 means winning 5 rounds, 0.0 means zero rounds won, with intermediate values reflecting partial success. Rewards are further adjusted based on termination cause: a player whose opponent made an invalid move receives a different reward than one who won through legitimate play.
This per-player outcome analysis matters most in multi-agent games where teammates share credit and opponents may cause early termination. The episode-level rewards serve logging and metrics; the actual training signal comes from per-step rewards computed by the Reward Assigner (Section 5.2.3). The structured metadata (outcome type, responsibility, episode reward) feeds into subsequent pipeline stages for step-level processing. Complete formulas appear in Appendix A.
5.2.2 Stage 2: Steps Filter (Eligibility Gating)
The Steps Filter determines which steps contain valid training signal, targeting the Missing Training Signal problem (Challenge 4.5). A step is marked ineligible (gated) if:
-
1.
It belongs to a non-trainable player (opponent).
-
2.
Its reward depends on future steps that are invalid or missing.
-
3.
It represents an incomplete interaction unit.
The core principle is: no observable outcome no training signal step is gated. Rather than assigning arbitrary rewards to steps with undefined outcomes, we simply exclude them from the training batch. Two concrete examples illustrate this:
Example 1: Opponent failure in Colonel Blotto.
Suppose our trainable player (Alpha) submits a valid allocation [A5 B10 C5], but the opponent (Beta) submits an invalid allocation [A100 B0 C0] (sum exceeds 20). The match terminates immediately due to the error from Beta. The valid action from Alpha has no outcome to learn from: there was no battlefield comparison, no round winner, no signal about whether the strategy was good or bad. The step is gated. However, the invalid action from Beta is trained on with a penalty reward, teaching error avoidance.
Example 2: Invalid guess format in Codenames.
The spymaster gives a valid clue [water 2]. The operative responds with “My answer is ocean” instead of the required format [ocean]. The turn terminates due to the format error. The clue has no guesses to evaluate: we cannot determine whether “water 2” was a good or bad clue because no valid guesses followed. The clue step is gated. The malformed response from the operative is trained on with a penalty to teach format compliance.
An important distinction: error steps remain eligible. If an action fails validation, we train on it with a penalty reward. The gating applies only to steps where no outcome exists to evaluate, for instance when a parsing failure prevents any valid response. Bad outcomes (e.g., hitting the assassin) are not gated; they produce observable results that inform reward computation. In such cases, the Reward Assigner (Section 5.2.3) handles blame attribution and may propagate penalties backward to earlier steps such as the originating clue.
5.2.3 Stage 3: Reward Assigner (Delayed Attribution)
TextArena (Guertler et al., 2025) does not provide meaningful per-step rewards; it returns only episode-level outcomes (e.g., for win, for loss). To obtain fine-grained credit assignment and address the temporal entanglement challenges (Lose to Win, Challenge 4.1; Interdependent Action Rewards, Challenge 4.2), we define a Reward Assigner for each environment. The Reward Assigner computes per-step rewards from full trajectory information following three principles:
Backward propagation.
The reward for a step may depend on what happens after it. Actions within a turn form interdependent groups where later outcomes affect earlier rewards. Consider two Codenames scenarios:
Success case: Spymaster gives clue [animal 2]. Operative guesses [dog] , then [cat] , then [pass].
All guesses receive positive rewards; the clue receives credit based on guess accuracy.
Assassin case: Spymaster gives clue [danger 2]. Operative guesses [sword] , then [assassin] .
The assassin guess receives a large penalty (). This penalty poisons the entire group: the correct guess [sword] receives reduced reward (+0.15 instead of +1.0), and the clue is blamed for leading to the assassin ().
This backward propagation ensures that all dependent actions share responsibility for collective outcomes. The spymaster cannot escape blame for a clue that led the operative to the assassin, even if individual guesses were locally reasonable.
Outcome modulation.
Per-step rewards are scaled by episode outcome, targeting the Lose to Win problem (Section 4.1): actions in winning games receive full credit regardless of intermediate round results, while the same actions in losing games receive reduced credit. The “losing” allocation that sets up a winning deception is credited appropriately.
Group-based attribution.
In Iterated Prisoner’s Dilemma, conversation turns and the subsequent decision form a “group.” The decision outcome determines the reward, which is then distributed to all group members. This addresses the Interdependent Rewards challenge (Section 4.2): all actions within a logical sequence are evaluated together based on collective outcome.
Complete reward formulas for each environment are provided in Appendix B.
5.3 Training System
The postprocessing pipeline requires high-throughput episode generation to collect enough training data. We built a training system comprising four subsystems (Figure 2).
5.3.1 vLLM Asynchronous Engine
The standard vLLM (Kwon et al., 2023) integration in TRL (von Werra et al., 2020) submits all samples in a single batch and blocks until the slowest completes. This triggers the heterogeneous inference bottleneck (Challenge 4.8): workers generating short responses sit idle while others deliberate.
Our integration sends each request individually through an API server, relying on vLLM’s continuous batching to aggregate requests internally while returning results as soon as each completes. No worker waits for another. To scale across multiple GPUs, we deploy data-parallel vLLM workers behind a coordinator that dispatches requests round-robin and returns responses to the correct caller. Weight synchronization after each training step updates all workers atomically without restarting the inference service.
5.3.2 Rollout Provider
The Rollout Provider maintains a pool of worker slots running episodes concurrently. Each slot encapsulates the full episode lifecycle: the action validator checks format and rule compliance during execution (Challenge 4.5), and the post-episode processing pipeline computes delayed rewards with backward attribution once the episode terminates (Challenges 4.1 and 4.2).
Three samplers configure each new episode. The Environment Sampler rotates through games round-robin so that no single game dominates gradients. The Position Sampler cycles uniformly through roles and positions within each game (Challenge 4.3), so the model learns position-agnostic strategies. The Opponent Sampler implements curriculum learning with maintained diversity (Challenge 4.4): weaker opponents early in training let the model learn rules and formats, while stronger and more varied opponents are unlocked at step thresholds; earlier opponents remain in the pool throughout training to preserve generalization across skill levels.
5.3.3 Rollout Builder & Dataloader
Completed episodes flow into a rolling episodes bank, decoupling episode generation from batch construction (Challenge 4.6). Rather than waiting for fixed-size rollouts, we accumulate episodes asynchronously and build training batches from whatever eligible steps are available.
To balance batches across multiple dimensions (Challenge 4.7), we over-collect episodes and subsample. Only steps passing the eligibility filter are considered. Steps are bucketed into reward quantiles, with coverage enforced across bins so that the model trains on successes, near-misses, and failures in proportion. Round-robin interleaving across environments prevents longer-episode games from dominating the batch.
5.3.4 Reinforcement Learning Trainer
Our trainer is built on TRL (von Werra et al., 2020) and uses a clipped policy gradient objective with the RLOO (Reinforce Leave-One-Out) baseline (Ahmadian et al., 2024). In standard single-turn LLM reinforcement learning, leave-one-out advantages are computed by grouping multiple completions of the same prompt: the baseline for each completion is the mean reward of the other completions answering the identical question. In our multi-step, multi-environment setting, no two steps share the same prompt. Instead, we group all steps by environment type: all Codenames steps form one group, all Colonel Blotto steps another, and all Iterated Prisoner’s Dilemma steps a third. Each step’s advantage is then computed relative to the mean reward of other steps from the same game, yielding meaningful comparisons despite prompt diversity. This per-environment normalization also prevents reward-scale differences across games from causing cross-task interference. Callbacks synchronize updated weights to the vLLM (Kwon et al., 2023) inference workers after each training step, closing the loop without service restarts.
5.4 Training Configuration
We initialized from Qwen3-8B (Team, 2025), a strong open-source base model that fits the 8B-parameter constraint of the Efficient track. The opponent curriculum (Challenge 4.4) proceeded in two phases.
Phase 1 (Steps 0–149).
Training began exclusively against gpt-oss-120b (OpenAI, 2025b), OpenAI’s open-source 120B-parameter model. We configured this opponent with varied system prompts that induced different behavioral profiles (aggressive, cooperative, unpredictable, and analytical) to provide diversity within a single model family. This phase let the model learn game rules, action formats, and basic strategies without facing opponents too strong for a novice agent.
Phase 2 (Steps 150+).
Once the model showed competence, we introduced frontier models accessed via OpenRouter (OpenRouter, Inc., 2025). Table 2 shows the expanded opponent pool with sampling weights. Higher weights indicate more frequent sampling; lower weights (0.25) reserve expensive frontier models for periodic challenge while keeping training efficient. Phase 1 opponents remained in the pool throughout training to preserve generalization across skill levels.
| Model | Weight |
|---|---|
| x-ai/grok-4-fast | 1.00 |
| google/gemini-2.5-flash | 1.00 |
| google/gemini-2.5-pro | 0.25 |
| openai/gpt-5-mini | 1.00 |
| openai/gpt-5 | 0.25 |
| qwen/qwen3-235b-a22b-thinking-2507 | 1.00 |
Hyperparameters.
We used 64 parallel workers for rollout generation with a maximum completion length of 12,288 tokens (covering both chain-of-thought reasoning and the final action). The global batch size was 768 steps. The learning rate was with cosine decay and warmup; the minimum learning rate was 10% of the peak value. We disabled the Kullback-Leibler divergence penalty (Cover and Thomas, 1991) (no reference model), letting the policy diverge freely from initialization. Training ran a single Proximal Policy Optimization (PPO) (Schulman et al., 2017) epoch per batch, following Kool et al. (2019).
5.5 Generation Parameter Tuning
Post-training analysis showed that win rates could be improved substantially by tuning generation parameters (temperature, top-, top-, min-) per environment. We ran a hyperparameter sweep of 300 games per environment for each parameter combination against a mixed pool of opponents. Table 3 shows the configurations evaluated and their win rates.
| Codenames | Col. Blotto | Iterated Prisoner’s Dilemma | ||||||
|---|---|---|---|---|---|---|---|---|
| Temp | Top- | Top- / Min- | W | W∗ | W | W∗ | W | W∗ |
| 1.0‡ | 1.0‡ | n/a‡ | 47 | 52 | 53 | 68 | 52 | 63 |
| 1.0 | 0.95 | 80 / 0.05 | 38 | 39 | 44 | 58 | 51 | 61 |
| 0.9 | 0.92 | 60 / n/a | 39 | 43 | 47 | 62 | 48 | 61 |
| 0.7⋆ | 0.80⋆ | 30 / 0.05⋆ | 47 | 54 | 44 | 59 | 54 | 65 |
| 0.8 | 1.0 | / 0.15 | 29 | 34 | 52 | 67 | 55 | 64 |
| 0.6† | 0.85† | 30 / n/a† | 49 | 51 | 48 | 62 | 60 | 74 |
| 0.35 | 0.70 | 20 / 0.03 | 40 | 42 | 51 | 69 | 61 | 70 |
No single configuration dominated across all environments, confirming the need for environment-specific tuning. Based on these results:
-
•
Codenames: We selected⋆ temp=0.70, top-=0.80, top-=30, min-=0.05 which achieved the best win rate excluding draws (54%).
-
•
Colonel Blotto: We selected‡ the least restrictive settings (temp=1.0, top-=1.0, no truncation) which achieved the best win rate including draws (53%) and the second best win rate excluding draws (68%).
-
•
Three-Player Iterated Prisoner’s Dilemma: We selected† temp=0.60, top-=0.85, top-=30 which achieved the best win rate excluding draws (74%).
Table 4 summarizes the final per-environment generation parameters used for evaluation. Default vLLM values are: temperature , top- , top- (consider all tokens), and min- .
| Environment | Temperature | Top- | Top- | Min- |
|---|---|---|---|---|
| Codenames | 0.70 | 0.80 | 30 | 0.05 |
| Colonel Blotto | 1.0† | 1.0† | 0.0† | |
| Three-Player Iterated Prisoner’s Dilemma | 0.60 | 0.85 | 30 | 0.0† |
6 Results
We report results from MindGames Arena (MindGames Organizers, 2025), where top-qualifying teams competed head-to-head across the three game environments described in Section 3. Our model placed first in both tracks during Stage 1 qualification and held this position through the final Stage 2 evaluation, finishing first in both tracks (Open and Efficient) with the same 8-billion-parameter model.
| Rank | Team | TrueSkill | Games | Win Rate |
|---|---|---|---|---|
| 1 | In2AI (ours) | 38.0±1.8 | 116 | 81.0% |
| 2 | RLGaming | 37.1±1.1 | 291 | 73.5% |
| 3 | Odyssean | 34.2±1.4 | 177 | 72.3% |
| 4 | PsychSkull | 31.3±1.4 | 121 | 62.8% |
| 5 | Corleone | 28.6±1.3 | 117 | 49.6% |
| Rank | Team | TrueSkill | Games | Win Rate |
|---|---|---|---|---|
| 1 | In2AI (ours) | 34.2±1.3 | 362 | 87.0% |
| 2 | STARS | 26.8±1.1 | 337 | 36.2% |
| 3 | RLGaming | 25.8±1.1 | 569 | 28.5% |
| 4 | Corleone | 24.4±1.4 | 367 | 44.1% |
| 5 | Odyssean | 16.6±1.4 | 275 | 10.9% |
Table 5 shows the Open Track results, where teams could use any model without restrictions on size or cost. Despite competing against submissions built on frontier models such as GPT-5 (OpenAI, 2025a), our 8B-parameter model achieved the highest TrueSkill (Herbrich et al., 2006) rating and win rate. Table 6 presents the Efficient Track results, restricted to models with at most 8 billion parameters; here, our model led by over 7 TrueSkill points.
The win rates across both tracks indicate that careful reward attribution and eligibility gating, paired with balanced training, allow a smaller model to compete with and outperform substantially larger systems across varied opponents and game types within the MindGames Arena (MindGames Organizers, 2025) evaluation framework.
7 Conclusion
We presented an approach for training agentic large language models centered on delayed per-step reward attribution with eligibility gating. Standard reinforcement learning assumes each step can receive an immediate, well-defined reward; this assumption fails in agentic settings where action quality depends on future events beyond any single agent’s control.
We identified eight challenges in multi-agent strategic interaction, organized into three themes: temporal entanglement (Sections 4.1–4.2), structural asymmetry (Sections 4.3–4.5), and training logistics (Sections 4.6–4.8). We addressed them through a post-episode processing pipeline that delays reward computation until sufficient information is available, attributes rewards backward according to task-specific semantics, and gates steps with undefined outcomes from training. Paired with high-throughput asynchronous infrastructure (continuous-batching inference, curriculum-based opponent sampling, and balanced batch construction), the system trains efficiently at scale.
A single 8-billion-parameter open-source model trained with this approach placed first in both the Open (unrestricted) and Efficient (8B) tracks of MindGames Arena (MindGames Organizers, 2025), outperforming teams that used substantially larger proprietary systems including GPT-5 (OpenAI, 2025a) in head-to-head competition.
Limitations and Future Work.
Our approach requires environment-specific implementations of the Action Validator, Players Builder, Steps Filter, and Reward Assigner; while the underlying ideas generalize, adapting to new settings requires domain expertise. Broader validation across agentic tasks beyond the three game environments studied here is an important next step.
Broader Impact.
In our experience, the engineering challenges outlined in Section 4 were a bigger bottleneck than model size; careful reward attribution mattered more than scaling up parameters. This suggests that systems-level engineering, not model scale, determines success in agentic RL. We expect these ideas to extend to other settings where action outcomes depend on future events, such as multi-turn dialogues, collaborative problem-solving, and agentic code generation.
Reproducibility.
Our trained model weights are publicly available at https://huggingface.co/AlekseyKorshuk/mindgames-in2ai-submission. The training code and evaluation scripts are available at https://github.com/AlekseyKorshuk/mindgames-in2ai.
Acknowledgements
We thank the MindGames Arena (MindGames Organizers, 2025) organizers for creating the evaluation framework and for hosting the competition at NeurIPS 2025. We are grateful to the developers of TextArena (Guertler et al., 2025) for the game environments and infrastructure. We also thank the vLLM (Kwon et al., 2023) and TRL (von Werra et al., 2020) teams, whose inference engine and RL framework form the backbone of our training system.
References
- Back to basics: revisiting reinforce style optimization for learning from human feedback in llms. External Links: 2402.14740, Link Cited by: §2, §5.3.4.
- Playing repeated games with large language models. External Links: 2305.16867 Cited by: §2.
- The evolution of cooperation. Science 211 (4489), pp. 1390–1396. Cited by: §2.
- Elements of information theory. 1st edition, Wiley, New York, NY, USA. Cited by: §5.4.
- DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. External Links: 2501.12948, Link Cited by: §1, §2.
- Strategic reasoning with language models. External Links: 2305.19165 Cited by: §2.
- TextArena. External Links: 2504.11442, Link Cited by: §1, §1, §2, 3rd item, §5.1, §5.2.3, Acknowledgements.
- TrueSkill™: a bayesian skill rating system. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.), Vol. 19. External Links: Link Cited by: §3, Table 5, Table 6, §6.
- Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §2.
- Buy 4 reinforce samples, get a baseline for free!. arXiv preprint arXiv:1904.06040. Cited by: §5.4.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, Cited by: Figure 2, §5.3.1, §5.3.4, Acknowledgements.
- MindGames arena: neurips 2025 competition. Note: https://www.mindgamesarena.com Cited by: §1, §2, §3, §4, §6, §6, §7, Acknowledgements.
- GPT-5 system card. Note: https://openai.com/index/gpt-5-system-card Cited by: §1, §4.4, §6, §7.
- Gpt-oss-120b & gpt-oss-20b model card. External Links: 2508.10925, Link Cited by: §5.4.
- OpenRouter: the unified interface for llms. Note: https://openrouter.ai Cited by: §5.4.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, pp. 27730–27744. Cited by: §1, §2.
- Generative agents: interactive simulacra of human behavior. External Links: 2304.03442 Cited by: §1.
- Automatic curriculum learning for deep rl: a short survey. External Links: 2003.04664 Cited by: §2.
- Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21 (178), pp. 1–51. Cited by: §2.
- The colonel blotto game. Economic Theory 29 (1), pp. 1–24. External Links: Document Cited by: §2.
- Prioritized experience replay. External Links: 1511.05952 Cited by: §2.
- Proximal policy optimization algorithms. External Links: 1707.06347, Link Cited by: §2, §4, §5.4.
- Reflexion: language agents with verbal reinforcement learning. External Links: 2303.11366 Cited by: §1, §2.
- Scaling llm test-time compute optimally can be more effective than scaling model parameters. External Links: 2408.03314, Link Cited by: §2.
- Qwen3 technical report. External Links: 2505.09388, Link Cited by: §5.4.
- TRL: transformer reinforcement learning. GitHub. Note: https://github.com/huggingface/trl Cited by: §5.3.1, §5.3.4, Acknowledgements.
- Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Document Cited by: §4.
- SPIN-bench: how well do llms plan strategically and reason socially?. External Links: 2503.12349, Link Cited by: §2.
- The surprising effectiveness of ppo in cooperative multi-agent games. External Links: 2103.01955 Cited by: §2.
- Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321–384. Cited by: §2.
Appendix A Episode-Level Reward Computation (Players Builder)
As discussed in Section 5.2.1, the Players Builder computes meaningful episode-level rewards from the raw game state. This section provides the exact formulas used for each environment.
A.1 Colonel Blotto
In Colonel Blotto, the game consists of rounds (typically ), and a player wins by securing a majority of rounds. Let denote the number of rounds won by player , and let be the majority threshold (rounds needed to win the match). The episode reward is:
| (1) |
For , we have , so:
-
•
Winning 5 rounds yields
-
•
Winning 3 rounds yields
-
•
Winning 0 rounds yields
This normalization makes rewards reflect progress toward victory rather than the binary win/loss outcome alone.
A.2 Codenames
In Codenames, each team has a target number of words to guess: for Red and for Blue. Let denote the number of correct team words guessed by team ’s operative, and let be that team’s goal. The episode reward is:
| (2) |
For a Red team player:
-
•
Guessing all 9 words yields
-
•
Guessing 5 words yields
-
•
Guessing 0 words yields
This reward is computed per team: both the spymaster and operative on the same team receive the same episode-level reward, since they share responsibility for collective performance.
A.3 Three-Player Iterated Prisoner’s Dilemma
In the three-player IPD, each player interacts with two opponents over rounds (typically ). The payoff matrix follows the standard structure with values (reward for mutual cooperation), (temptation to defect), (sucker’s payoff), and (punishment for mutual defection). Let denote player ’s cumulative score across all rounds and interactions.
The maximum possible score per round is (one interaction with each opponent). The episode reward is normalized by the theoretical maximum:
| (3) |
With standard payoffs () and rounds:
-
•
Maximum score is (defecting against two always-cooperating opponents)
-
•
Mutual cooperation yields , so
-
•
Mutual defection yields , so
A.4 Outcome Determination
Beyond reward computation, the Players Builder determines categorical outcomes (win/loss/draw/invalid) using the following priority:
-
1.
Invalid action: If any player made an invalid action, that player receives invalid_action outcome; others receive other_player_invalid_action.
-
2.
Parsing failure: Same logic applies for parsing failures.
-
3.
Game-specific termination: Assassin hits (Codenames), subset clues (Codenames), or other rule violations.
-
4.
Score comparison: Higher score wins; equal scores result in draw.
This structured outcome determination feeds the Steps Filter and Reward Assigner with the information they need for step eligibility and reward attribution decisions.
Appendix B Per-Step Reward Attribution
This section details the per-step reward assignment and step filtering logic for each game environment. All rewards are clipped to with strict floors for invalid actions () and parsing failures ().
B.1 Codenames
Codenames presents unique challenges because spymaster clues must be evaluated based on operative guesses, creating inter-step dependencies.
B.1.1 Action Validation
-
•
Spymasters (players 0, 2): Must produce [word N] where word is a single alphabetic token and N is a positive integer.
-
•
Operatives (players 1, 3): Must produce [word] where word is a single token (or “pass”).
-
•
Substring rule: Clues that are substrings of (or contain) any board word are marked as violations, receiving penalty .
B.1.2 Clue Reward (Spymaster)
For a clue with declared count when team words remain unguessed, let . After the operative makes correct guesses (out of the attempts within scope):
| (4) |
where adds fixed penalties if the turn resulted in guessing an opponent word () or the assassin (). Special cases:
-
•
: reward (invalid clue count)
-
•
(overshooting): linear penalty in based on overshoot magnitude
-
•
(no correct guesses):
B.1.3 Guess Reward (Operative)
For guess within the scope of a clue ():
| Condition | Reward |
|---|---|
| Correct (own team word) | (group average) |
| Neutral word | |
| Duplicate guess | |
| Off-board word | |
| Opponent word | |
| Assassin | |
| Early pass (within ) |
For the th guess (bonus attempt), passing is rewarded (), while continuing risks penalties. Correct guesses receive the group average reward , computed over all guesses within scope, so that operatives share credit and blame for collective performance.
B.1.4 Outcome Multipliers
Positive step rewards are scaled by episode outcome: win , draw , loss .
B.1.5 Step Filtering
-
•
Steps from non-trainable players are excluded.
-
•
Clue steps with condition = correct but len(guesses) = 0 are excluded (no signal available).
-
•
Invalid/parsing-failed steps remain eligible (to train format compliance).
B.2 Colonel Blotto
Colonel Blotto involves multiple rounds of resource allocation with immediate per-round feedback.
B.2.1 Round-Level Reward
Each round produces a winner (or tie). For player in round :
| (5) |
B.2.2 Match-Level Shaping
The match outcome (determined by who wins more rounds) modulates step rewards:
| (6) |
where and the match multiplier is:
| Match Outcome | |
|---|---|
| Win | |
| Draw | |
| Loss |
Winning individual rounds therefore matters more when they contribute to match victory.
B.2.3 Step Filtering
-
•
All trainable player steps are initially eligible.
-
•
If the opponent made an invalid move (ending the match early), the last step of the trainable player is excluded if it occurred after the invalid action, since no learning signal is available for that incomplete round.
B.3 Three-Player Iterated Prisoner’s Dilemma
Iterated Prisoner’s Dilemma combines conversation rounds with binary decisions, requiring group-based reward attribution.
B.3.1 Step Grouping
Each Iterated Prisoner’s Dilemma round consists of conversation steps followed by 1 decision step, forming a group of size . Rewards are attributed to the entire group based on the decision outcome.
B.3.2 Round Gain Normalization
Let be player ’s score increment in round . The group reward is:
| (7) |
where are the standard Iterated Prisoner’s Dilemma payoff matrix values (Reward, Temptation, Sucker, Punishment). The denominator normalizes by the maximum possible per-interaction gain.
B.3.3 Reward Distribution
The group reward is assigned equally to all valid steps in the group (conversation + decision). If the decision step is invalid or parsing-failed:
-
•
The decision step receives its standard penalty ( or ).
-
•
All conversation steps in that group receive reward (no propagation of positive signal to actions that led to an invalid decision).
B.3.4 Outcome Multipliers
Positive step rewards are scaled by episode outcome: win , draw , loss .
B.3.5 Step Filtering
-
•
All trainable player steps are initially eligible.
-
•
Incomplete groups: If a round ends mid-conversation (before the decision step), all non-error conversation steps in that incomplete group are marked ineligible because they have no associated outcome.
-
•
Invalid decision propagation: If the decision step is invalid, preceding conversation steps in that group are marked ineligible (no positive signal to propagate), but error steps remain eligible for training.
B.4 Summary
The key principles across all environments are:
-
1.
Strict error floors: Parsing failures () and invalid actions () always dominate shaped rewards, so format compliance is learned quickly.
-
2.
Delayed attribution: Step rewards may depend on future events (guesses after clues, decisions after conversations), computed only after episode completion.
-
3.
Signal-based filtering: Steps without valid dependent information are excluded rather than trained with arbitrary rewards.
-
4.
Outcome shaping: Episode-level outcomes modulate step-level rewards, connecting local decisions to global success.