Agentic Abstention

This repository is a research artifact release for the agentic abstention benchmark. It contains the code and lightweight artifacts needed to reproduce the benchmark protocol across three agent environments:

web/: WebShop-based web navigation tasks.
qa/: Q&A tasks with Wikimedia multi-turn search.
terminal/: TerminalBench immediate and delayed abstention tasks.

Raw datasets, search indexes, model outputs, debug traces, and cluster job artifacts are intentionally not tracked. Each environment includes download or materialization instructions for external assets.

Repository Layout

web/        WebShop instruction rewriting, missing-target construction, and evaluation
qa/         AbstentionBench-style Q&A datasets with Wikimedia SEARCH episodes
terminal/   TerminalBench task construction, Harbor configs, and analysis tools
docs/       Benchmark protocol, metric definitions, and data-source notes

Quick Start

Clone the repository and enter the environment you want to reproduce:

git clone https://github.com/lhannnn/agentic-abstention.git
cd agentic-abstention

For WebShop:

cd web
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

For Q&A:

cd qa
conda env create -f environment.yml
conda activate abstention-bench
pip install -e .

For TerminalBench:

cd terminal
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

See the environment-specific README files for data downloads and run commands.

Benchmark Protocol

Agentic abstention evaluates whether an agent knows when to stop and abstain instead of continuing with unsupported actions. The concrete action interface is environment-specific: Q&A uses Wikimedia search episodes, Web uses browser navigation, and Terminal uses command-line interaction.

The main metrics are Timely Recall, Overall Recall, SPL, and pass@k. See docs/metrics.md for the metric definitions and docs/benchmark_protocol.md for the environment-level protocol summary.

Data Policy

This repository tracks only small code, prompt, config, manifest, and test files. It does not include:

raw WebShop product/search assets,
raw or materialized TerminalBench task directories,
HuggingFace dataset caches,
Wikimedia dumps or retrieval indexes,
API keys or .env files,
model outputs, logs, plots, and debug traces.

See docs/data_sources.md and each environment's download/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
qa		qa
site		site
terminal		terminal
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Abstention

Repository Layout

Quick Start

Benchmark Protocol

Data Policy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Abstention

Repository Layout

Quick Start

Benchmark Protocol

Data Policy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages