This repository is a research artifact release for the agentic abstention benchmark. It contains the code and lightweight artifacts needed to reproduce the benchmark protocol across three agent environments:
web/: WebShop-based web navigation tasks.qa/: Q&A tasks with Wikimedia multi-turn search.terminal/: TerminalBench immediate and delayed abstention tasks.
Raw datasets, search indexes, model outputs, debug traces, and cluster job artifacts are intentionally not tracked. Each environment includes download or materialization instructions for external assets.
web/ WebShop instruction rewriting, missing-target construction, and evaluation
qa/ AbstentionBench-style Q&A datasets with Wikimedia SEARCH episodes
terminal/ TerminalBench task construction, Harbor configs, and analysis tools
docs/ Benchmark protocol, metric definitions, and data-source notes
Clone the repository and enter the environment you want to reproduce:
git clone https://github.com/lhannnn/agentic-abstention.git
cd agentic-abstentionFor WebShop:
cd web
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txtFor Q&A:
cd qa
conda env create -f environment.yml
conda activate abstention-bench
pip install -e .For TerminalBench:
cd terminal
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txtSee the environment-specific README files for data downloads and run commands.
Agentic abstention evaluates whether an agent knows when to stop and abstain instead of continuing with unsupported actions. The concrete action interface is environment-specific: Q&A uses Wikimedia search episodes, Web uses browser navigation, and Terminal uses command-line interaction.
The main metrics are Timely Recall, Overall Recall, SPL, and pass@k. See
docs/metrics.md for the metric definitions and docs/benchmark_protocol.md
for the environment-level protocol summary.
This repository tracks only small code, prompt, config, manifest, and test files. It does not include:
- raw WebShop product/search assets,
- raw or materialized TerminalBench task directories,
- HuggingFace dataset caches,
- Wikimedia dumps or retrieval indexes,
- API keys or
.envfiles, - model outputs, logs, plots, and debug traces.
See docs/data_sources.md and each environment's download/README.md.