Skip to content

Aiden0526/EvoArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Benchmarking and improving LLM agents under persistent environment evolution.

Paper Project Page Code Dataset

Status. This repository is being released progressively. The main experiment folders are EvoMem-PersonaMem-Evo/, EvoMem-Terminal-Bench-Evo/, and EvoMem-SWE-Chain-Evo/.

What Is EvoArena?

Most agent benchmarks evaluate a fixed environment snapshot. Real deployments are messier: workflows change, codebases evolve, dependencies move, validation rules tighten, and user preferences shift over time. EvoArena evaluates whether LLM agents can remain reliable under this kind of persistent environment evolution.

EvoArena models each environment as a chain of progressively evolving releases. The high-level setting remains connected across releases, but the executable workflow, repository state, or user preference state changes. Agents must solve the current task while avoiding stale behavior learned from earlier versions.

EvoArena covers three complementary domains:

Benchmark Environment Evolution Base Agent What It Tests
Terminal-Bench-Evo Executable terminal workflows evolve through interface, path, toolchain, validation, and policy changes. Terminus 2 Whether agents adapt terminal strategies across workflow releases.
SWE-Chain-Evo Repository states evolve through chronological software milestones. OpenHands Whether agents solve new code requirements without regressing prior behavior.
PersonaMem-Evo User preferences evolve across long interaction histories. A-Mem Whether memory agents track preference updates, conflicts, and temporal trajectories.

Dataset Composition

EvoArena reports both step accuracy and chain accuracy. Step accuracy measures whether an agent solves an individual evolved task instance. Chain accuracy is stricter: a chain is correct only when the agent succeeds across the required sequence of related evolutionary steps.

Persistent Environment Evolution

Terminal-Bench-Evo

Terminal-Bench-Evo starts from Terminal-Bench tasks and turns each one into a discrete workflow-version chain. The same high-level terminal objective is preserved, but later releases may change deployment mechanisms, paths, branches, dependencies, permissions, or validation logic.

SWE-Chain-Evo

SWE-Chain-Evo evaluates agents on software repositories whose code states evolve through milestone updates. At each step, the agent receives the current accumulated repository state and a localized software requirement, then produces a patch evaluated by target and regression tests.

PersonaMem-Evo

PersonaMem-Evo evaluates long-horizon memory under evolving user preferences. Questions require agents to infer current preferences from long interaction histories, resolve conflicts, and reason about how earlier preferences were revised.

EvoMem

EvoMem is a lightweight, git-like memory paradigm for evolving environments. Instead of storing only the latest memory state, EvoMem records meaningful memory updates as patches. Each patch captures what changed, why it changed, and which evidence supported the update. At inference time, agents can retrieve the latest memory together with relevant historical patches when a task depends on overwritten, conflicting, or version-specific information.

EvoMem is designed as a wrapper around existing agents rather than a replacement for them. In this repository, EvoMem is integrated with:

  • A-Mem for PersonaMem-Evo.
  • Terminus 2 for Terminal-Bench-Evo.
  • OpenHands for SWE-Chain-Evo.

Repository Structure

EvoArena/
├── EvoMem-PersonaMem-Evo/       # A-Mem + EvoMem on PersonaMem-Evo
├── EvoMem-Terminal-Bench-Evo/   # Terminus 2 + EvoMem on Terminal-Bench-Evo
├── EvoMem-SWE-Chain-Evo/        # OpenHands + EvoMem on SWE-Chain-Evo
├── assets/                      # README figures
└── README.md

Running Experiments

Each main experiment folder is self-contained and will include its own detailed README.md, environment setup, scripts, and command examples. The top-level README only provides the entry points.

PersonaMem-Evo

cd EvoMem-PersonaMem-Evo
# See EvoMem-PersonaMem-Evo/README.md for setup, data preparation, and evaluation.

This folder contains the currently released EvoMem implementation for preference-memory evolution with A-Mem.

Terminal-Bench-Evo

cd EvoMem-Terminal-Bench-Evo
# See EvoMem-Terminal-Bench-Evo/README.md for Terminus 2 setup and evaluation.

This folder will contain the Terminal-Bench-Evo evaluation harness and EvoMem wrapper for chain-scoped terminal patch memory.

SWE-Chain-Evo

cd EvoMem-SWE-Chain-Evo
# See EvoMem-SWE-Chain-Evo/README.md for OpenHands setup and evaluation.

This folder will contain the SWE-Chain-Evo evaluation harness and EvoMem wrapper for software patch memory.

Data

The EvoArena dataset will be released on Hugging Face:

If you use a specific benchmark subset, please also check the corresponding experiment folder for any additional preprocessing or environment-packaging instructions.

Links

Citation

If you find EvoArena or EvoMem useful, please cite our paper:

@article{xu2026evoarena,
  title   = {EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments},
  author  = {Jundong Xu and Qingchuan Li and Jiaying Wu and Yihuai Lan and Shuyue Stella Li and Huichi Zhou and Bowen Jiang and Lei Wang and Jun Wang and Anh Tuan Luu and Caiming Xiong and Hae Won Park and Bryan Hooi and Zhiyuan Hu},
  journal = {arXiv preprint arXiv:2606.13681},
  year    = {2026}
}

License

The license will be added with the public release. Please check the final repository license before using the code or data in downstream projects.

About

Code and Data for Paper "EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors