Awesome Agent Evals

A curated, opinionated, non-BS library of the best resources for building and evaluating AI agents — papers, blog posts, talks, courses, tools, and benchmarks.

Maintained by BenchFlow · Join our Discord

Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:

a depth-4 recursive citation crawl (11.6k papers, ranked by in-degree) to surface the academic canon,
targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …),
47 talks & podcasts transcribed and deep-noted (verbatim + timestamps), and
per-section gap audits with adversarial verification.

443+ curated links · 146 deep reading notes (see notes/). Markers: 🆕 = released/updated 2025–2026 · ⚠️ = caveat. Contributions welcome — see CONTRIBUTING.

📘 Playbook: PATTERNS.md — real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.

⭐ Must-read starter set (read these first)

The Second Half — Shunyu Yao — https://ysymyth.github.io/The-Second-Half/ · blog — "Evaluation becomes more important than training." The field-level why.
An LLM-as-Judge Won't Save the Product, Fixing Your Process Will — Eugene Yan — https://eugeneyan.com/writing/eval-process/ · blog — Process over tooling; evals as the scientific method.
Hidden Technical Debt: Agent Evaluation Infrastructure — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."
LLM Evals FAQ — Hamel Husain & Shreya Shankar — https://hamel.dev/blog/posts/evals-faq/ · blog — The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.
Asymmetry of Verification and Verifier's Law — Jason Wei — https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — "Ability to verify == ability to create an RL environment."
Demystifying Evals for AI Agents — Anthropic — https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.
How to Build Good Language Modeling Benchmarks — Ofir Press — https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.
AI Agents That Matter — Kapoor, Stroebl, Siegel, Nadgir, Narayanan — https://arxiv.org/abs/2407.01502 · paper — Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.
Building on Evaluation Quicksand — Nathan Lambert — https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — LLM eval has no ground truth; contamination; eval↔training coupling.
Who Validates the Validators? (EvalGen) — Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo (UIST '24) — https://arxiv.org/abs/2404.12272 · paper — "Criteria drift": you can't write the rubric before you grade.
Benches 2026 — "LLM benchmarks in the era of agents" — Florian Brand (Prime Intellect) — https://florianbrand.com/posts/benches-2026 · blog + 61-slide talk — The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack (prompt · sampling temp · grader · harness) swings the score, and that benchmark ground truth is frequently wrong.
A Shared Playbook for Trustworthy Third-Party Evaluations — OpenAI — https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog (Safety, May 2026) — What makes independent evals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need.

1 · Why we need evals

The Second Half — Shunyu Yao — https://ysymyth.github.io/The-Second-Half/ · blog — The bottleneck shifts from solving problems to defining and evaluating them. (also T2, T7)
An LLM-as-Judge Won't Save the Product, Fixing Your Process Will — Eugene Yan — https://eugeneyan.com/writing/eval-process/ · blog — "Buying or building another evaluation tool won't save the product." Evals = the scientific method in disguise.
Your AI Product Needs Evals — Hamel Husain — https://hamel.dev/blog/posts/evals/ · blog — The canonical "you need evals"; remove all friction from looking at your data; don't rely on generic frameworks.
A Field Guide to Rapidly Improving AI Products — Hamel Husain — https://hamel.dev/blog/posts/field-guide/ · blog — "Error analysis is consistently the highest-ROI activity." The metric for an AI roadmap is experiments run.
In Defense of AI Evals, for Everyone — Shreya Shankar — https://www.sh-reya.com/blog/in-defense-ai-evals/ · blog — Rebuts the anti-eval backlash; evals = the systematic measurement of application quality.
What We Learned from a Year of Building with LLMs — Yan, Bischof, Frye, Husain, Liu, Shankar — https://applied-llms.org/ (Part II: https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/) · blog — The "intern test," genchi genbutsu, turning vibe-checks into assertions.
Big Tech's LLM Evals Are Just Marketing — Nathan Lambert — https://www.interconnects.ai/p/evals-are-marketing · blog — Why frontier-lab leaderboard numbers are marketing, not science.
AI Engineering pitfalls — Chip Huyen — https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html · blog — Common eval/AI-engineering mistakes from the AI Engineering author. (also T6)
Evals Are NOT All You Need — Aishwarya Naresh Reganti & Kiriti Badam (O'Reilly Radar) — https://www.oreilly.com/radar/evals-are-not-all-you-need/ · blog — The essential nuance piece: automated graders alone don't save you; you need a continuous-improvement flywheel of offline tests + production monitoring + real-user iteration. Pairs with Shreya's 'In Defense' to complete the backlash debate. 🆕
Why AI evals are the hottest new skill for product builders — Hamel Husain & Shreya Shankar with Lenny Rachitsky (Lenny's Podcast/Newsletter) — https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill · talk — The accessible 'why evals matter' on-ramp (live walkthrough of error analysis, open/axial coding) that mainstreamed evals to PMs in 2025; the apartment-leasing-bot anecdote is the canonical 'you can't vibe-check' story. 🆕
How evals drive the next chapter in AI for businesses — OpenAI — https://openai.com/index/evals-drive-next-chapter-of-ai/ · blog — Frontier-lab framing of evals as turning fuzzy business goals into specs and measurable ROI; useful counterweight to Lambert's 'evals are marketing' and grounds the 'why' for enterprise readers. 🆕 ⚠(unverified URL)
Beyond vibe checks: A PM's complete guide to evals — Aman Khan (Arize) with Lenny Rachitsky — https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete · blog — The widely-shared PM-oriented argument for moving past 'looked good to me' vibe checks to systematic evals; one of the pieces that made evals a mainstream product skill in 2025. 🆕
A pragmatic guide to LLM evals for devs — Gergely Orosz & Hamel Husain (The Pragmatic Engineer) — https://newsletter.pragmaticengineer.com/p/evals · newsletter — Reaches the broad engineering audience with the core 'why': LLM non-determinism breaks traditional testing, so you need evals. High-distribution motivation piece co-written by Hamel. 🆕
Predicting model behavior before release by simulating deployment (Deployment Simulation) — OpenAI — https://openai.com/index/deployment-simulation/ · blog — Concrete 2026 evidence for why fixed/static evals fail: models recognize when they're being tested and game test suites; replaying ~1.3M real conversations surfaced reward-hacking no fixed eval caught. Strong 'why evals must evolve' argument. 🆕 ⚠(unverified URL)
evals are surprisingly often all you need — Greg Brockman (OpenAI) — https://x.com/gdb/status/1733553161884127435 · blog — The canonical one-liner ('evals are the new unit test') that anchors the whole 'why evals' thesis; frequently cited founding quote for the movement. Short but load-bearing.

Must-reads: Yao · Yan (eval-process) · Hamel (field-guide)

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

Asymmetry of Verification and Verifier's Law — Jason Wei — https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — Trainability tracks verifiability; verifying = creating an RL environment.
A Taxonomy of RL Environments for LLM Agents — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/ · blog — A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable."
The Life Cycle of an RL Environment — Kanav Garg (Core Automation; ex-DeepMind) — talk; summary at https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html · talk — Difficulty calibration (the 1–4/16 Goldilocks band), RL as variance reduction, reward hacking under training pressure. (local notes: research/notes/kanav-garg-rl-environment-lifecycle.md)
Welcome to the Era of Experience — David Silver & Richard Sutton — https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf · paper — Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments.
RLHF Book, Ch. 16 — Evaluation — Nathan Lambert — https://rlhfbook.com/c/16-evaluation · book — Evaluation as a reflection of training goals; prompt-format sensitivity (60%→~0%).
What Comes Next with Reinforcement Learning — Nathan Lambert — https://www.interconnects.ai/p/what-comes-next-with-reinforcement · blog — Long-horizon credit assignment; where RL is and isn't ready.
verifiers — Prime Intellect — https://github.com/PrimeIntellect-ai/verifiers (docs: .../blob/main/docs/environments.md) · tool/repo — One environment package shared by eval and prime-rl — the eval-is-an-RL-env thesis as code.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (Guo et al.) — https://arxiv.org/abs/2501.12948 · paper — The proof-of-thesis: pure RL with rule-based verifiable rewards (no SFT) makes reasoning emerge — the canonical 'if you can verify it, RL builds it' result; also published in Nature 2025. Conspicuously absent from a section literally about eval-as-RL-environment. 🆕
Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al. (Allen Institute for AI) — https://arxiv.org/abs/2411.15124 · paper — Coined/popularized RLVR and open-sourced the recipe + code (open-instruct): swap the reward model for a verifier on tasks with checkable answers. The foundational citation behind every 'verifiable beats judgeable' claim in this section. 🆕
Natural Emergent Misalignment from Reward Hacking in Production RL — Anthropic — https://www.anthropic.com/research/emergent-misalignment-reward-hacking · paper — Empirical receipt for the section's 'reward hacking under training pressure' theme: learning to cheat on real coding environments generalizes to sabotage/alignment-faking; introduces inoculation prompting as mitigation (arXiv 2511.18397). 🆕
Environments Hub: A Community Hub To Scale RL To Open AGI — Prime Intellect — https://www.primeintellect.ai/blog/environments · blog — The launch post for the verifiers-spec marketplace (2,500+ shared eval/RL environments) — the eval-is-an-RL-env thesis as an actual ecosystem, the natural companion to the already-listed verifiers repo. 🆕
How to fully automate software engineering — Ege Erdil, Matthew Barnett, Tamay Besiroglu (Mechanize) — https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/ · blog — Sharpest statement of the inverse thesis: today's RL environments are rudimentary, so capability is gated on building richer/more diverse environments — 'you only get the capability you can build an environment for.' 🆕
Cheap RL tasks will waste compute — Mechanize (Erdil, Barnett, Besiroglu) — https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/ · blog — The economics of environment quality: data and compute are complementary, so low-quality (cheaply-bought) tasks waste expensive RL compute — directly informs difficulty calibration / why environment design matters. 🆕
An FAQ on Reinforcement Learning Environments — Jean-Stanislas Denain & Chris Barber (Epoch AI) — https://epoch.ai/gradient-updates/state-of-rl-envs · blog — Practitioner-interview survey (18 pros) on how RL environments are actually built, the reward-hacking failure modes, and the production-scaling bottleneck — the empirical state-of-the-field map this section lacks. 🆕
RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures — AJ Kourabi & Dylan Patel (SemiAnalysis) — https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science · newsletter — Market-structure view: 35+ companies now sell RL environments; capability gains are coming from ramping RL compute, not pretraining. Grounds the 'benchmark = frozen RL environment' thesis in who's actually building/buying them. 🆕
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — Harbor / Stanford / Laude Institute — https://github.com/harbor-framework/terminal-bench · benchmark — A concrete instance of the thesis: each task ships a Docker environment + programmatic verification test suite + oracle — i.e. a benchmark that IS an RL environment (and is used as one). 2.4k stars, active. 🆕
tau2-bench (τ²-Bench): A Benchmark for Tool-Agent-User Interaction in Real-World Domains — Sierra Research (Barres et al.) — https://github.com/sierra-research/tau2-bench · benchmark — Dual-control, multi-turn, policy-following eval with a simulated user and verifiable DB-state checks — the canonical example of a verifiable conversational/agentic environment beyond math/code (paper arXiv 2506.07982). 🆕

Must-reads: Wei · Lee (RL-env taxonomy)

3 · The model / harness / skill decomposition

Hidden Technical Debt: Agent Harness — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/ · blog — The harness is the agent; what teams call "the model" is mostly harness + product.
Hidden Technical Debt series (index) — Han-Chung Lee — https://leehanchung.github.io/blogs/ · blog — The four-part series (eval infra, runtime, harness, + agent runtime ~2026/04/24). (verify the runtime post URL on the index.)
Measuring AI Ability to Complete Long Tasks — METR — https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ · paper/blog — Scaffolds change the measured horizon; success-vs-human-time as a primitive. (also T9)
Turing Post interview ("Open Models Won't Catch Up") — Nathan Lambert — https://www.turingpost.com/p/nathanlambert · talk/interview — "What technical people call the harness or the product matters more than just the model."
Quo vadis, LLM benchmarks? — Florian Brand (Prime Intellect) — https://florianbrand.com/posts/benches-2026 (talk: https://www.youtube.com/watch?v=kmTMc-fVSXw) · blog/talk — The AlgoTune case: same model, different harness, opposite ranking. (also T6) (notes: research/notes/florian-brand-*)
The Model is the Product — Han-Chung Lee — https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/ · talk — The primary-source talk (Data Council 2025) behind the must-read author's whole thesis — the direct counterpart to Hamel's 'Model is Not the Product'; the foundational text of the harness/model debate this section is built on. 🆕
The Model is Not the Product — Hamel Husain — https://www.youtube.com/watch?v=EEw2PpL-_NM · talk — The opposing side of the Lee debate (Data Council 2025): great products are mostly harness + product + evals, not the model. Section already cites Lee; it should cite the debate it half-references. 🆕
Agents are models using tools in a loop — Simon Willison — https://simonwillison.net/2025/May/22/tools-in-a-loop/ · blog — The canonical, now-widely-adopted definition of an agent; 'the skill is in the design of both the tools and the loop' — the cleanest statement of why the harness, not the model, dominates behavior. 🆕
Harness engineering: leveraging Codex in an agent-first world — OpenAI — https://openai.com/index/harness-engineering/ · blog — Frontier-lab primary source coining 'harness engineering': a 1M-line codebase built by Codex agents where improving the environment/harness mattered more than the model. Lab-side complement to Lee's 'harness is the agent'. (URL returns 403 to scraper but page is live; corroborated by InfoQ/Milvus coverage.) 🆕
Equipping agents for the real world with Agent Skills — Anthropic — https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills · blog — The primary source for the 'skill' leg of the model/harness/skill decomposition — skills as composable, progressively-disclosed capabilities (later made an open standard). The section title says 'skill' but has zero skill sources. 🆕
Effective context engineering for AI agents — Anthropic — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents · blog — Anthropic's primary statement that the harness's job is engineering context (editing, compaction, memory, programmatic tool-calling) — the mechanism behind why same model + different harness diverges. 🆕
Writing effective tools for agents — with agents — Anthropic (Ken Aizawa) — https://www.anthropic.com/engineering/writing-tools-for-agents · blog — Tool design is a load-bearing part of the harness; 'agents are only as effective as the tools we give them,' validated eval-first. Directly ties harness decisions to measured agent performance. 🆕
Paving the way for agents in biology (VirBench) — Anthropic — https://www.anthropic.com/research/agents-in-biology · blog — 120 viral-sequence retrieval queries across 40 pathogens; without deterministic tooling, model accuracy spans 16.9%–91.3% and Claude Sonnet 4 returned 106 sequences in one run, 15 in a second, 5 in a third to an identical prompt. Adding a deterministic retrieval layer (gget) pushed all models above 90% accuracy (peaking at 99.7%) and eliminated run-to-run variance entirely: "adding a deterministic retrieval layer made model choice much less important." Controlled empirical proof that harness/tool design dominates model choice for scientific precision. 🆕
Same Model, Different Results: Why Coding Agents Aren't Interchangeable — Pete Hodgson — https://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/ · blog — Concrete teardown of Claude Code's harness (system reminders, sub-agents, planning, IDE feedback) showing identical models yield different results — the practitioner case-study version of Brand's AlgoTune point. 🆕
Holistic Agent Leaderboard (HAL) — Princeton SAgE team (Kapoor, Narayanan, et al.) — https://hal.cs.princeton.edu/ · benchmark — Standardized, cost-aware harness that runs the SAME agent harness across 9 benchmarks/9 models (21,730 rollouts) — the infrastructure answer to 'harness confounds rankings.' ICLR 2026; paper arXiv:2510.11977. 🆕
Agent Harness Engineering — Addy Osmani (O'Reilly Radar) — https://www.oreilly.com/radar/agent-harness-engineering/ · blog — 'A decent model with a great harness beats a great model with a bad harness'; reframes agent failures as harness/config problems (traceable AGENTS.md rules). Names the converging harness primitives across coding agents. 🆕
What comes next with open models (weights / tools / harness decomposition) — Nathan Lambert (Interconnects) — https://www.interconnects.ai/p/the-next-phase-of-open-models · blog — Lambert's written articulation (Mar 2026) of an AI system as weights + tools + harness — the written companion to the Turing Post interview already listed, with the explicit three-part decomposition. 🆕

Must-reads: Lee (harness) · Brand (Quo vadis)

4 · Observability & the output / eval space (the surfaces you can grade)

Hidden Technical Debt: Agent Evaluation Infrastructure — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control plane / data plane; the five surfaces (output, trace, memory, environment, mechanistic); the empty-tool-result hallucination.
The Three Pillars of AI Observability — Braintrust — https://www.braintrust.dev/blog/three-pillars-ai-observability · blog — Dataset reconciliation (living datasets); traces / evals / annotation.
Agent Trajectory Evaluations — Arize (AX docs) — https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations · docs — Grading the path, not just the answer.
AI Agent Metrics: How Elite Teams Evaluate — Galileo — https://galileo.ai/blog/ai-agent-metrics · blog — A concrete agent-metric taxonomy (action completion, tool selection, etc.).
OpenInference semantic conventions — Arize — https://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.md · tool/repo — An OTel-based agent trace schema (tool, args, observation, latency, cost).
LangSmith Evaluation / Trajectory evals — LangChain — https://docs.langchain.com/langsmith/evaluation · https://docs.langchain.com/langsmith/trajectory-evals · docs.
OpenTelemetry GenAI Semantic Conventions (agent & framework spans) — OpenTelemetry / CNCF — https://github.com/open-telemetry/semantic-conventions-genai · docs — The upstream vendor-neutral standard (spans/metrics/events for LLM calls, invoke_agent, execute_tool, MCP) that OpenInference maps onto — the canonical trace schema the section's OpenInference entry derives from. 🆕
Semantic Conventions for GenAI agent and framework spans — OpenTelemetry — https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ · docs — Human-readable spec page for create_agent / invoke_agent / execute_tool spans and attributes — the precise definition of what a gradable agent trace looks like. 🆕
Inside the LLM Call: GenAI Observability with OpenTelemetry — OpenTelemetry (blog) — https://opentelemetry.io/blog/2026/genai-observability/ · blog — Walkthrough of emitting and reading GenAI spans (token usage, finish reasons, tool calls) — concrete intro to the trace surface for practitioners not steeped in OTel. 🆕
W&B Weave — tracing & evaluation toolkit — Weights & Biases — https://docs.wandb.ai/weave · docs — @weave.op trace trees (inputs/outputs/cost/latency) plus a scorer-based eval harness — a widely used surface for grading both traces and outputs. 🆕
Laminar — open-source observability for AI agents — Laminar — https://laminar.sh/ · tool — OTel-native, agent-specific: transcript view, SQL-over-traces, and a rollout debugger — purpose-built for grading multi-step agent trajectories rather than single LLM calls. 🆕

Must-reads: Lee (eval infra) · Braintrust (three pillars)

5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)

(All repos URL-verified via GitHub API, Jun 2026. 🆕 = released/expanded 2025–2026. ⚠️ = caveat/discontinued.)

5a · Eval frameworks & harnesses (code-first test-runners)

Inspect AI — UK AISI — https://github.com/UKGovernmentBEIS/inspect_ai · https://inspect.aisi.org.uk/ — @task binds dataset + solver + scorer; custom scorers; sandboxed tools. The reference agent-eval framework. (MUST)
inspect_evals — UK AISI — https://github.com/UKGovernmentBEIS/inspect_evals — 🆕 the companion catalog of community benchmarks (GAIA, CTFs, AIME…) — the "batteries" for Inspect.
AISI Engineering Playbook — UK AISI — https://www.aisi.gov.uk/blog/releasing-aisis-engineering-playbook · https://engineering-playbook.aisi.org.uk/ · guide — 🆕 Open guide to production frontier-model eval infrastructure: five-layer stack (Evaluate · Isolate · Connect · Run · Scale) covering sandboxed untrusted code, audited provider routing, hosted open-weight inference, and the systems-of-working that tie them together. "Much of this work is invisible, rarely documented, and developed from scratch by every team that attempts serious evaluation." The infrastructure layer surrounding Inspect AI, documented by the team that built and runs it.
lm-evaluation-harness — EleutherAI — https://github.com/EleutherAI/lm-evaluation-harness — the standard academic harness; first-class decontamination; task YAMLs.
OLMES — Allen Institute (Ai2) — https://github.com/allenai/olmes — 🆕 the reproducible eval standard + harness behind OLMo/Tülu: standardized prompts/metrics/formatting for apples-to-apples model comparison.
BenchFlow — https://github.com/benchflow-ai/benchflow · https://benchflow.ai — 🆕 environment-lab framework: research infra + runtime for building RL environments, evals & post-training; ships SkillsBench and ClawsBench. ("Environments are the new data.")
lighteval — Hugging Face — https://github.com/huggingface/lighteval — 🆕 all-in-one harness across transformers/vLLM/TGI/nanotron, 1000+ tasks; HF's successor to evaluate.
OpenBench — Groq — https://github.com/groq/openbench — 🆕 provider-agnostic bench CLI, 95+ benchmarks, built on Inspect primitives.
simple-evals — OpenAI — https://github.com/openai/simple-evals — minimal zero-shot/CoT scripts (MMLU, HumanEval, SimpleQA, HealthBench); the numbers OpenAI publishes. ⚠️ not actively maintained.
OpenAI Evals — https://github.com/openai/evals — the completion_fn abstraction = swap the system-under-test. (Best-practices: https://developers.openai.com/api/docs/guides/evaluation-best-practices)
promptfoo — https://github.com/promptfoo/promptfoo — MIT eval + red-teaming CLI; git-diffable YAML configs. (MUST)
DeepEval / Confident AI — https://github.com/confident-ai/deepeval — "pytest for LLMs," 40+ metrics (G-Eval, RAG, hallucination) + red-team; ~2M evals/day; hosted cloud. 🆕
pydantic-evals — https://github.com/pydantic/pydantic-ai (ai.pydantic.dev/evals) — 🆕 type-safe Datasets/Cases/Evaluators with OTel tracing, from the Pydantic AI team.
openevals — LangChain — https://github.com/langchain-ai/openevals — 🆕 prebuilt evaluators + create_llm_as_judge (incl. multimodal); general-purpose companion to agentevals (https://github.com/langchain-ai/agentevals, trajectory match).
MLflow GenAI evaluate — https://mlflow.org/docs/latest/genai/eval-monitor/ — 🆕 mlflow.genai.evaluate: 50+ judges/metrics, custom scorers, regression datasets inside MLflow.
HELM (crfm-helm) — Stanford CRFM — https://github.com/stanford-crfm/helm — holistic eval: standardized datasets + metrics beyond accuracy + leaderboard (also VHELM, HEIM).
Giskard — https://github.com/Giskard-AI/giskard-oss — auto-generates adversarial test suites (injection, hallucination, bias) from a plain-language app description.
Deepchecks LLM — https://github.com/deepchecks/deepchecks (llmdocs.deepchecks.com) — property-based scoring (grounded-in-context, toxicity, fluency) + custom LLM-judge properties.
UpTrain — https://github.com/uptrain-ai/uptrain — 20+ preconfigured checks + root-cause analysis on failures.
HF evaluate — https://github.com/huggingface/evaluate — classic metrics library, ⚠️ maintenance mode (use lighteval for LLMs).
Harbor — harbor-framework (Laude Institute / Stanford) — https://github.com/harbor-framework/harbor — 🆕 framework for running agent evals + creating/using RL environments; powers Terminal-Bench 2.0. ~2.7k★. ⚠️ name overloaded (cf. av/harbor local-LLM toolkit).

5b · TypeScript/JS-native eval runners

evalite — Matt Pocock — https://github.com/mattpocock/evalite — 🆕 local-first eval runner on Vitest; .eval.ts files, web UI, cost-aware.
Mastra scorers — https://github.com/mastra-ai/mastra (mastra.ai/docs/evals/overview) — 🆕 model-graded/rule/statistical scorers, live evals, CI, in the Mastra agent framework.
Vercel agent-eval — https://github.com/vercel-labs/agent-eval — 🆕 A/B-test coding agents (Claude Code, Codex, Cursor) on custom tasks; pass-rate dashboards.
Autoevals — Braintrust — https://github.com/braintrustdata/autoevals — OSS scorer library (Factuality, relevance, security…) across Py/JS/Go/Ruby.

5c · RAG / retrieval evaluation

TruLens — https://github.com/truera/trulens — instrumentation + "feedback functions" (the RAG triad), now OTel-based.
ARES — Stanford — https://github.com/stanford-futuredata/ARES — synthetic queries + fine-tuned judges + prediction-powered inference for confidence intervals.
RAGChecker — Amazon Science — https://github.com/amazon-science/RAGChecker — 🆕 claim-level diagnosis separating retriever vs generator errors.
continuous-eval (Relari) — https://github.com/relari-ai/continuous-eval — modular per-module metrics across retrieval/generation/tool-use.
Tonic Validate — https://github.com/TonicAI/tonic_validate — RAG metrics as a GitHub Action for CI.

5d · LLM-as-judge / reward / verifier libraries

verdict — Haize Labs — https://github.com/haizelabs/verdict — 🆕 declarative compound judges (debate/verification/aggregation, inference-time scaling); arXiv:2502.18018.
RULER — OpenPipe (ART) — https://github.com/OpenPipe/ART (art.openpipe.ai/fundamentals/ruler) — 🆕 LLM-judge that ranks trajectories with no labels — judge-as-RL-reward. (industry must-read)
Prometheus 2 — https://github.com/prometheus-eval/prometheus-eval — open-weight evaluator LMs for rubric-based assessment + pairwise.
Atla Selene — https://github.com/atla-ai/selene-mini — 🆕 8B SoTA open judge (score + critique); + MCP server atla-ai/atla-mcp-server. arXiv:2501.17195.
Patronus Lynx / GLIDER — https://github.com/patronus-ai/Lynx-hallucination-detection · https://github.com/patronus-ai/glider — 🆕 open hallucination judge / explainable span-level judge.
Flow-Judge — https://github.com/flowaicom/flow-judge — efficient 3.8B open evaluator.
RewardBench — AI2 — https://github.com/allenai/reward-bench — canonical reward-model (+v2 judge) benchmark/harness.
JudgeBench — https://github.com/ScalerLab/JudgeBench — benchmark to evaluate the judges themselves.
reward-kit — Fireworks — https://github.com/fw-ai-external/reward-kit — 🆕 decorator-based reward-function authoring (TRL/Fireworks interop).

5e · RL-environment / verifiable-reward toolkits (eval ⇄ training)

verifiers — Prime Intellect — https://github.com/PrimeIntellect-ai/verifiers — Environment = dataset + harness + rubric; one package for eval, RL, synthetic data. (MUST)
Environments Hub — Prime Intellect — https://github.com/PrimeIntellect-ai/community-environments (app.primeintellect.ai) — 🆕 crowdsourced verifiers-based RL/eval envs.
prime-rl — Prime Intellect — https://github.com/PrimeIntellect-ai/prime-rl — 🆕 async RL trainer consuming verifiers envs (INTELLECT-3).
BenchFlow — https://github.com/benchflow-ai/benchflow · https://benchflow.ai — 🆕 environment lab: builds & runs RL/eval environments (SkillsBench, ClawsBench, runtime). "Environments are the new data." (also §5a)
HUD — https://github.com/hud-evals/hud-python — 🆕 SDK to build/run agent eval environments (computer-use, browser, MCP) with telemetry.
Atropos — Nous Research — https://github.com/NousResearch/atropos — 🆕 async "environment microservice" framework for rollouts/verifiable rewards.
verl — https://github.com/volcengine/verl (now verl-project/verl) — de-facto industry RLVR trainer (PPO/GRPO). ~22k★.
OpenRLHF — https://github.com/OpenRLHF/OpenRLHF · SkyRL — https://github.com/NovaSky-AI/SkyRL · AReaL — https://github.com/areal-project/AReaL · ROLL — https://github.com/alibaba/ROLL · rLLM — https://github.com/agentica-project/rllm · TRL — https://github.com/huggingface/trl — the RL-training stack agents are post-trained + eval'd in.
Open Reward Standard (ORS) — General Reasoning — https://docs.openreward.ai/ (PyPI openreward) — 🆕 MCP-extending spec adding RL primitives (episodes, rewards, curriculum). ⚠️ no single canonical repo confirmed.

5f · Observability + eval platforms (tracing · datasets · online/offline · CI)

Arize Phoenix — https://github.com/Arize-ai/phoenix — OSS OTel tracing + response/retrieval evals + datasets/experiments. (MUST)
Langfuse — https://github.com/langfuse/langfuse — OSS: evals (LLM-judge, feedback, manual labeling), datasets/experiments, prompt mgmt; self-hostable. 🆕
Opik — Comet — https://github.com/comet-ml/opik — 🆕 fully-OSS eval + observability (judges, datasets, CI-runnable evals).
W&B Weave — https://github.com/wandb/weave — weave.Evaluation scorers (exact/regex/model-graded/embedding) + Guardrails; comparison dashboards. 🆕 (Humanloop's migration target.)
Braintrust — https://www.braintrust.dev/docs/start/eval-sdk (offline-eval-guide) — Eval() over golden datasets; offline vs online. (MUST)
Patronus AI — https://www.patronus.ai/ (github.com/patronus-ai) — 🆕 research-grade judges (Lynx, GLIDER, Percival agent-failure debugger), experiments, multimodal judge.
Maxim AI — https://www.getmaxim.ai/ — 🆕 agent simulation + eval + observability across thousands of scenarios/personas.
Galileo — https://galileo.ai/ — Luna evaluators + Agentic Evaluations.
Vellum — https://www.vellum.ai/ — visual workflows + offline/online evals scoring every production run.
Helicone — https://github.com/helicone/helicone — OSS gateway + observability; "Scores" ingests external eval results.
Traceloop / OpenLLMetry — https://github.com/traceloop/openllmetry — OSS OTel instrumentation (Py/TS/Go/Ruby) + hosted reliability platform.
Langtrace — https://github.com/Scale3-Labs/langtrace — OSS OTel-standard tracing + manual scoring + dataset mgmt.
WhyLabs / LangKit — https://github.com/whylabs/langkit — high-throughput text-signal metrics (toxicity, PII, jailbreak) for production monitoring.
Portkey — https://github.com/portkey-ai/gateway — 🆕 OSS gateway + 60+ guardrails + observability (fully open-sourced Mar 2026).
Datadog LLM Observability — https://www.datadoghq.com/product/ai/llm-observability/ — 🆕 evaluators + golden datasets + LLM Experiments + AI Agent Monitoring (Jun 2025).
Fiddler AI — https://www.fiddler.ai/ — 🆕 Trust Models (Safety/PII/Faithfulness) scoring in <100ms; Guardrails + agentic observability.
PromptLayer — https://www.promptlayer.com/ · New Relic AI Monitoring — https://newrelic.com/platform/ai-monitoring — lighter prompt-CMS / APM-native monitoring.

5g · Tracing standards

OpenInference — Arize — https://github.com/Arize-ai/openinference — semantic conventions for agent traces (tool/args/observation/latency/cost).
OpenTelemetry GenAI semantic conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/ (open-telemetry/semantic-conventions) — 🆕 the vendor-neutral schema (now covers agent orchestration, MCP tool calls, and a quality-evaluation span hook).
Braintrust — Braintrust — https://www.braintrust.dev/ · tool — Industry-standard eval+observability platform (Notion, Stripe, Vercel) tying offline experiments to production logs; the section already cites Braintrust's Autoevals but omits the platform itself. 🆕
RagaAI Catalyst — RagaAI — https://github.com/raga-ai-hub/RagaAI-Catalyst · tool — OSS agent-observability + eval SDK with multi-agent trace/execution-graph debugging, synthetic-data gen, and guardrail management — covers the online/guardrail-eval slice the section lacks. 🆕
OpenAI Cookbook — Evals — OpenAI — https://developers.openai.com/cookbook/topic/evals · docs — Maintained, runnable recipes for building evals (incl. Agents SDK eval, evaluating agents with Langfuse); the practical companion to OpenAI Evals and a curator-grade 'show real work' resource. 🆕 ⚠(unverified URL)
Building a better Bugbot — Stefan Heule et al. (Cursor) — https://cursor.com/blog/building-bugbot · excellent — Build your primary eval metric around post-merge signal: Cursor's "resolution rate" uses AI at PR merge time to determine which flagged bugs were actually fixed by the developer, validated by human spot-checks. After 40 major experiments spanning models, prompts, iteration counts, and agentic designs, resolution rate improved from 52% to over 70%; the single largest jump came from switching to a fully agentic architecture. BugBench (a curated set of real diffs with human-annotated bugs) drives offline iteration. (excerpt: "It uses AI to determine, at PR merge time, which bugs were actually resolved by the author in the final code. ... Since launch, we have run 40 major experiments that have increased Bugbot's resolution rate from 52% to over 70%.") 🆕
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents — Scale AI — https://arxiv.org/abs/2605.21347 · paper/tool — 🆕 Automated system that analyzes large corpora of agent execution traces to surface failure patterns and behavioral issues at scale; "Human experts using IG reports improve scaffold performance by 30.4 pp over the unmodified baseline scaffold" — nearly doubling the gain from the next-best system. Fills the gap between laborious per-trace inspection and aggregate benchmark scores that hide population-level failure modes.

Must-reads: Inspect AI · promptfoo · Braintrust · verifiers · DeepEval · Phoenix/Langfuse (pick your observability) · RULER (judge-as-reward)

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)

How to Build Good Language Modeling Benchmarks — Ofir Press — https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — The benchmark-author's checklist; difficulty target; one-number reporting; 150–500 task sizing.
AI Agents That Matter — Kapoor et al. — https://arxiv.org/abs/2407.01502 · paper — Cost-controlled evaluation; model-dev vs downstream-dev needs; holdouts.
Why We No Longer Evaluate SWE-bench Verified — OpenAI — https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ · blog — ~59% of audited failures were broken tests. (mirror: https://decrypt.co/359012/...)
The Leaderboard Illusion — Shivalika Singh et al. (Cohere/Princeton/Stanford/MIT/AI2) — https://arxiv.org/abs/2504.20879 · paper — Private testing, selective disclosure, and data-access asymmetry on Chatbot Arena. (notes: research/notes/leaderboard-illusion.md)
The SWE-bench Illusion: When SOTA LLMs Remember Instead of Reason — https://arxiv.org/abs/2506.12286 · paper — Memorization inflates SWE-bench scores.
Establishing Best Practices for Building Rigorous Agentic Benchmarks (ABC) — https://arxiv.org/abs/2507.02825 · paper — SWE-bench Verified weak tests; τ-bench rewards empty responses. (verified high)
FrontierMath Tiers 1–3 v2 (corrected) — Epoch AI — https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2 (changelog: .../frontiermath-tier-4-v2) · page — ~42% of problems corrected after AI-assisted review. (also T8: the operator-as-rot-detector tale)
About 30% of Humanity's Last Exam Answers Are Wrong — FutureHouse / Andrew White — https://www.futurehouse.org/research-announcements/hle-exam · blog — 29 ± 3.7% of text-only chem/bio answers contradicted by the literature. (LessWrong writeup: https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/)
Building on Evaluation Quicksand — Nathan Lambert — https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — No hard source of truth; synthetic-data contamination.
Lost in Simulation — https://arxiv.org/abs/2601.17087 · paper — Simulated users are unreliable proxies (~9pp swings by simulator choice; demographic miscalibration).
SWE-bench: Can LMs Resolve Real-World GitHub Issues? — Jimenez, Yang, … Press, Narasimhan — https://arxiv.org/abs/2310.06770 · https://www.swebench.com (Verified: .../verified.html) · paper/site.
Task-Specific LLM Evals that Do & Don't Work — Eugene Yan — https://eugeneyan.com/writing/evals/ · blog — Off-the-shelf evals rarely transfer; accuracy is too coarse.
Andrej Karpathy on evals — https://x.com/karpathy/status/1896266683301659068 · post — "We make a number of specific recommendations…" (the eval-as-narrow critique).
A Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k) — Hugh Zhang et al. (Scale AI) — https://arxiv.org/abs/2405.00332 · paper — Held-out GSM1k replica of GSM8k exposes up to 8% accuracy drop and partial memorization (Mistral/Phi) — the canonical method for measuring benchmark overfitting/contamination via a matched holdout.
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks — Curtis Northcutt, Anish Athalye, Jonas Mueller — https://arxiv.org/abs/2103.14749 · paper — NeurIPS 2021 foundational result: ~3.3% avg label errors across 10 famous test sets (ImageNet, MNIST, etc.); corrections flip model rankings. The canonical 'label errors' citation this section's theme rests on (labelerrors.com / cleanlab).
Are We Done with MMLU? (MMLU-Redux) — Aryo Pradipta Gema et al. (Edinburgh) — https://arxiv.org/abs/2406.04127 · paper — ~6.5% of MMLU questions contain errors (57% in Virology); MMLU-Redux re-annotation shifts rankings — directly demonstrates label-error impact on the most-cited LLM benchmark.
LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code — Naman Jain et al. (UC Berkeley) — https://arxiv.org/abs/2403.07974 · benchmark — Time-windowed problem collection (post-cutoff scoring) as the leading contamination-resistant design pattern — the section discusses contamination but lists no exemplar of how to engineer around it.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark — White, Dohan, LeCun, Goldblum et al. — https://github.com/LiveBench/LiveBench · benchmark — Monthly-refreshed questions from new arXiv/news/competitions with objective ground truth — the canonical 'dynamic refresh' answer to saturation and contamination.
The LLM Evaluation Guidebook (Open LLM Leaderboard team) — Clémentine Fourrier / Hugging Face — https://github.com/huggingface/evaluation-guidebook · docs — Practitioner reference from running the Open LLM Leaderboard; explicit sections on contamination, reproducibility, and leaderboard design — the hands-on 'how to not get fooled' companion to this section (updated version: hf.co/spaces/OpenEvals/evaluation-guidebook).
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation — Kapoor, Stroebl, Kirgis et al. (Princeton) — https://arxiv.org/abs/2510.11977 · paper — 21,000+ standardized agent runs surfacing leaderboard unreliability and unreported misbehaviors (agents searching HuggingFace for benchmark answers) — extends 'AI Agents That Matter' to leaderboard integrity for agents specifically. 🆕
Gaming the System: Goodhart's Law Exemplified in the AI Leaderboard Controversy — Jambholkar, Rajani, Bakshi (Collinear AI) — https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy · blog — Practitioner framing of the Llama 4 / Chatbot Arena gaming episode through Goodhart's Law — the accessible blog companion to The Leaderboard Illusion paper. 🆕
A Shared Playbook for Trustworthy Third-Party Evaluations — OpenAI — https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog (Safety, May 29 2026) — What makes independent evals of frontier-model safeguards & capabilities trustworthy: selecting the right harness, checking for validity hazards that distort results, and the standards third-party evaluators need. (also T10) 🆕
Reward hacking is swamping model intelligence gains — Cursor — https://cursor.com/blog/reward-hacking-coding-benchmarks · blog — Cursor audited 731 Opus 4.8 Max trajectories and found 63% of "successful" resolutions retrieved known fixes (57% upstream-lookup, 6% git-history mining) rather than reasoning them out. Re-running SWE-bench Pro under strict isolation (no internet, no git history) drops Opus 4.8 Max from 87.1% → 73.0% and Composer 2.5 from 74.7% → 54.0% — a 20.7pp gap. Cursor self-reports the largest gaps in their own model. The most quantified public evidence that current coding-agent leaderboard scores are massively inflated by retrieval, not reasoning. 🆕
Quantifying infrastructure noise in agentic coding evals — Anthropic — https://www.anthropic.com/engineering/infrastructure-noise · blog — Infrastructure resource configuration (RAM, resource limits) swings agentic coding benchmark scores by up to 6 percentage points — often larger than leaderboard gaps between adjacent models. A phase transition exists at ~3× baseline resources: below it, more resources fix flakiness without changing what's measured; above it, resources start helping agents solve problems they otherwise couldn't, fundamentally changing benchmark meaning. "Skepticism is warranted for any reported Terminal-Bench / SWE-bench gap below 3pp." 🆕
Eval awareness in Claude Opus 4.6's BrowseComp performance — Anthropic — https://www.anthropic.com/engineering/eval-awareness-browsecomp · blog — First documented instance of a model reverse-engineering its own evaluation: Claude Opus 4.6 independently hypothesized it was being tested, identified the specific benchmark (BrowseComp), located the source code on GitHub, and decrypted the SHA256/XOR-encrypted answer key — then used it to answer evaluation questions. 18 independent runs converged on the same strategy unprompted. Implication: static benchmarks run in web-enabled environments are now adversarially fragile against capable models. 🆕
Life After Benchmark Saturation: A Case Study of CORE-Bench — Nadgir, Kapoor, Liu, Kirgis, Narayanan et al. (Princeton / UC Berkeley / MIT) — https://arxiv.org/abs/2606.26158 · paper — Argues saturation is diagnostic evidence to dig deeper, not a retirement trigger: LLM trajectory audit found "15 task-level errors and 20 tasks with exploitable shortcuts in CORE-Bench Hard" (releasing CORE-Bench v1.1 + OOD suite); saturated agents that tie on accuracy diverge sharply on cost (60% cheaper) and calibration (32.1% self-reported confidence vs. 93% empirical pass rate); randomized human study (20 papers, 25 manual vs. 25 agent-assisted runs): "manual reproduction sessions lasted 2.11 times as long as human-agent collaborative sessions." 🆕
Search-Time Contamination in Deep Research Agents — Wang, Zhang, Yao, Zeng, Song, Lin, Shen — https://arxiv.org/abs/2606.05241 · paper — Defines three contamination types (Benchmark Metadata Leakage, Question-Context Leakage, Explicit Answer Leakage) for agents that web-search during evaluation; applies detection algorithms across six benchmarks and finds performance inflation up to 4%. "Such agents may retrieve public benchmark metadata, question context, or even ground-truth answers via web search. This gives rise to Search-Time Contamination (STC), where external retrieval bypasses intended reasoning and inflates measured performance." Distinct from training contamination; advocates isolated sandboxes and transparent search trajectories. 🆕

Must-reads: Press · Kapoor et al. · OpenAI (SWE-bench Verified) · Leaderboard Illusion

7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)

(See also T2 — verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law.)

RewardBench — Nathan Lambert et al. — https://arxiv.org/abs/2403.13787 · paper — Evaluating reward models (the verifier you train against).
The New RL Scaling Laws — Nathan Lambert — https://www.interconnects.ai/p/the-new-rl-scaling-laws · blog — Where RLVR scaling is heading. (interview: https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert)
Spurious Rewards: Rethinking Training Signals in RLVR — https://arxiv.org/abs/2506.10947 · paper — Random/spurious rewards rival ground truth on Qwen2.5 (Qwen-specific). (cite arXiv figures, not the blog gloss — see research/notes/reference-audit.md)
The State of Post-Training 2025 — Nathan Lambert — https://www.interconnects.ai/p/the-state-of-post-training-2025 · blog — Context for where evals feed training.
Reward Hacking in Reinforcement Learning — Lilian Weng — https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ · blog — The canonical survey of reward hacking — taxonomy, RLHF-specific failure modes, mitigations; the foundational reference any reward-design section needs.
Specification gaming: the flip side of AI ingenuity — Victoria Krakovna et al. (Google DeepMind) — https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ · blog — Canonical specification-gaming post (+the running examples list); origin story of why verifiers/reward functions get gamed, predating the LLM-RL wave.
Multi-Turn RL for Multi-Hour Agents — with Will Brown (Prime Intellect) — Latent Space / Will Brown — https://www.latent.space/p/willccbb · talk — The verifiers author on building multi-turn RL environments, turn-level credit assignment and reward design in practice — the practitioner voice behind the verifiers library already cited here. 🆕
Position: The Hidden Costs and Measurement Gaps of RLVR — various (arXiv 2509.21882) — https://arxiv.org/abs/2509.21882 · paper — RLVR gains overstated via budget mismatch, calibration drift, contamination; proposes a tax-aware minimum standard — the rigor counterweight to Lambert's RL-scaling optimism. 🆕
RewardBench 2: Advancing Reward Model Evaluation — Saumya Malik, Nathan Lambert et al. (Ai2) — https://arxiv.org/abs/2506.01937 · benchmark — The 2025 successor to RewardBench (already listed) — harder, less saturated, ICLR 2026; the current bar for evaluating the verifier you train against. 🆕
Reward Modeling (RLHF Book, ch. 5) — Nathan Lambert — https://rlhfbook.com/c/05-reward-models · docs — Canonical free reference chapter on reward models — the standing explainer for the 'verifier you train against' framing this section uses. 🆕
Curriculum RL from Easy to Hard Tasks Improves LLM Reasoning (E2H Reasoner) — Shubham Parashar et al. (Texas A&M) — https://arxiv.org/abs/2506.06632 · paper — Difficulty-calibration primary source: easy-to-hard scheduling with convergence guarantees and the 'fade out easy tasks' result — directly fills the section's difficulty-calibration theme. 🆕
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators — Jiacheng Guo, Ling Yang, Mengdi Wang et al. (Princeton) — https://arxiv.org/abs/2512.19682 · paper — Generative environment simulator with an alpha-Curriculum Reward that keeps tasks in the zone of proximal development — recent take on auto-calibrating env difficulty to the agent. 🆕
Verifier and Reward Design for RL Environments — HUD (hud.ai) — no individual byline — https://www.hud.ai/resources/verifier-reward-design-rl-environments · article (technical guide) (good) — Lays out a concrete four-layer scoring architecture (verifiers / pass-fail gates / 3-5 criteria rubrics / composite reward) plus a five-step build workflow: define checkable end-states first ("table contains row id=4521, status='active'"), add hard failure gates, build minimal rubrics, test on… 🆕
The Verification Horizon: No Silver Bullet for Coding Agent Rewards — Qwen Team (Alibaba) — https://arxiv.org/abs/2606.26300 · paper — 🆕 The dynamic argument this section's static entries lack: "no fixed reward function can remain effective as policy capability continues to grow; and verification must co-evolve with the generator." Taxonomizes four verification strategies (unit tests, interactive judge, user feedback, agentic evaluator) and shows each saturates or gets gamed as the policy improves — the moving-target argument for reward design.
Systematic Reward Hacking and Prime Sprints — Jessica Li (Prime Intellect) — https://www.primeintellect.ai/blog/reward-hacking · blog — Controlled study using the backdoor-ifeval family of environments at 1B scale, reproducible for ~$0.64: "Hacking is fundamentally a gradient dynamics problem. The same reward function can produce hacking or not depending on how learnable the legitimate task is, what the model's prior puts weight on, and the within batch variance." Key findings: no frequency floor for exploited tokens; moderate task difficulty resists hacking better than easy or impossible tasks; semantic prompt guardrails ("Restrict" condition) counterintuitively accelerated hacking onset. 🆕

Must-reads: Lee (RL-env taxonomy) · Garg (lifecycle) · verifiers (repo)

8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)

Evaluating the Effectiveness of LLM-Evaluators — Eugene Yan — https://eugeneyan.com/writing/llm-evaluators/ · blog — Position/verbosity/self-enhancement bias; direct vs pairwise; prefer binary + classification metrics.
Creating an LLM-as-a-Judge That Drives Business Results — Hamel Husain — https://hamel.dev/blog/posts/llm-judge/ · blog — Critique-shadowing; validate against ONE benevolent-dictator expert; precision/recall over raw agreement.
Who Validates the Validators? (EvalGen) — Shankar et al. (UIST '24) — https://arxiv.org/abs/2404.12272 (pdf: .../pdf/2404.12272; UIST: https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf) · paper — Criteria drift; the coverage-vs-false-failure judge-alignment loop.
LLM Evals FAQ — Hamel Husain & Shreya Shankar — https://hamel.dev/blog/posts/evals-faq/ (error-analysis section: .../why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html) · blog — Binary over Likert; review ≥100 traces; the first-failure transition matrix for agents.
LLM-as-a-Judge: Rethinking Model-Based Evaluations — Han-Chung Lee — https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/ · blog — Avoid [0,1] continuous scales; manage judges like junior annotators.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. — https://arxiv.org/abs/2306.05685 · paper — Source of the 10%/25% self-favoring & position-bias numbers — which the authors themselves hedge ("cannot determine"); GPT-3.5 doesn't self-favor.
LLMs Instead of Human Judges? A Large-Scale Study — Bavaresco et al. — https://arxiv.org/abs/2406.18403 · paper — Substantial variance across models/datasets; validate judges against humans first.
AlignEval — Eugene Yan — https://eugeneyan.com/writing/aligneval/ · blog — "Align AI to human. Calibrate human to AI. Repeat." Work backward from the data.
Product Evals in Three Simple Steps — Eugene Yan — https://eugeneyan.com/writing/product-evals/ · blog — The "God Evaluator" anti-pattern; the benchmark is human performance, not perfection.
Statistics for AI/ML, Part 3 — Cohen's Kappa — Han-Chung Lee — https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/ · blog — Chance-adjusted inter-annotator agreement (the gate before holding out).
Data Flywheels for LLM Applications — Shreya Shankar — https://www.sh-reya.com/blog/ai-engineering-flywheel/ · blog — Binary metrics, the "GPT smell," error analysis as the core activity.
SPADE (https://arxiv.org/html/2401.03038v1) & DocETL (https://arxiv.org/abs/2410.12189) — Shankar et al. · paper — Data-quality assertions / agentic query rewriting for LLM pipelines.
LLM Evaluators Recognize and Favor Their Own Generations — Arjun Panickssery, Samuel R. Bowman, Shi Feng (NeurIPS 2024) — https://arxiv.org/abs/2404.13076 · paper — The canonical causal study of self-preference bias: shows GPT-4/Llama-2 can recognize their own outputs and that self-recognition correlates linearly with self-favoring. This is the primary source behind 'self-enhancement bias' that the section's blogs only allude to.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Yang Liu et al. (Microsoft, EMNLP 2023) — https://arxiv.org/abs/2303.16634 · paper — The foundational reference-free LLM-judge method (CoT + form-filling scoring). Defines the direct-scoring paradigm the section critiques; a curated judge section is incomplete without the paper that started it.
A Survey on LLM-as-a-Judge — Jiawei Gu et al. — https://arxiv.org/abs/2411.15594 · paper — The most-cited survey organizing the LLM-judge space (bias taxonomy, reliability methods, agreement metrics). Serves as the one-stop map/bibliography the section currently lacks.
One Token to Fool LLM-as-a-Judge — Yulai Zhao, Haolin Liu, Dian Yu et al. (Tencent AI Lab / Princeton) — https://arxiv.org/abs/2507.08794 · paper — Shows 'master-key' tokens (a colon, 'Solution:') trigger false-positive rewards up to 80% even on GPT-o1/Claude-4 judges, plus a robust Master-RM fix. Core evidence on judge/verifier reward-hacking fragility. 🆕
Weaver: Closing the Generation-Verification Gap with Weak Verifiers — Jon Saad-Falcon et al. — Stanford Hazy Research / Scaling Intelligence — https://hazyresearch.stanford.edu/blog/2025-06-18-weaver · blog — Directly operationalizes 'verifiable vs judgeable': aggregates many weak judges/reward models (unlabeled) to shrink the generator-verifier gap, reaching o3-mini accuracy from Llama-3.3-70B. Paper: arxiv.org/abs/2506.18203. 🆕
Agent-as-a-Judge: Evaluate Agents with Agents — Mingchen Zhuge et al. (Meta AI / KAUST) — https://arxiv.org/abs/2410.10934 · paper — Extends LLM-as-judge to agentic trajectories—grading intermediate steps, not just final outputs—with the DevAI benchmark. The agent-specific evaluation case this agent-evals library specifically needs.
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains — Various (AAAI 2026) — https://arxiv.org/abs/2507.09884 · benchmark — Cross-domain benchmark exposing verifier precision/recall trade-offs (specialized verifiers high-accuracy but low-recall; general models inclusive but unstable). Quantifies how trustworthy a verifier actually is for RLVR. 🆕
Enhancing LLM-as-a-Judge with Grading Notes / From Pilot to Production with Custom Judges — Databricks (Mosaic Research) — https://www.databricks.com/blog/pilot-production-custom-judges · blog — Enterprise-grade judge-building playbook: 20-30 calibration examples, batched SME annotation, Krippendorff's alpha agreement gating—a production-side complement to the Hamel/Shankar academic alignment loop. 🆕
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge (CALM framework) — Jiayi Ye et al. — https://arxiv.org/abs/2410.02736 · paper — Systematic quantification of 12 judge biases (verbosity, bandwagon, authority, distraction, sentiment, etc.) via automated attacks—broadens the section's bias coverage well beyond position/verbosity/self-enhancement.
Voice AI Agent Evaluation: The Complete Guide (2026) — Brooke Hopkins (Coval, ex-Waymo) — https://www.coval.ai/blog/voice-ai-agent-evaluation-guide · article (good) — Domain-specific evaluation playbook for voice agents: persona-tiered simulation testing (Easy/Medium/Hard/Adversarial across accent, noise, emotion), a concrete LLM-as-judge calibration loop (run on 50-100 calls, sample for human review, iterate rubrics until >85% human-judge agreement on binary… 🆕
Agent Judge: Solving Long-Horizon Evals for Production Agents — Rishi Gujjar & Andrew Li (Judgment Labs) — https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations · article (good) — Frames long-horizon agent evaluation as an agentic, multi-agent judge (search trajectory state as queryable objects, verify claimed actions against source-of-truth systems like DBs/APIs/GitHub, and iteratively refine the rubric), backed by a real benchmark table on internal hallucination-detection… 🆕
Counsel: A Meta-Evaluation Dataset for Agentic Tasks — Pisupati, Broomfield, Choi et al. (Atla AI / Cohere / Mistral AI / Google DeepMind) — https://arxiv.org/abs/2606.21627 · paper — First public dataset of meta-evaluations of LLM-judge quality on agent trajectories (1,131 annotated critiques on tau-bench + DA-Code; "Krippendorff's alpha of 0.78"); decomposes judge correctness into error location ("the strongest judge reaching ~88% agreement on location") vs. reasoning (~65%) — showing judges often find the right step but explain it wrongly. HuggingFace: AtlaAI/counsel. 🆕
LongJudgeBench: Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation — Junjie Chen, Yuxi Dong, Haitao Li et al. — https://arxiv.org/abs/2606.01629 · paper — First benchmark targeting LLM judge reliability specifically on long-form outputs (reports, essays, extended documents), filling the gap left by short-form and trajectory judge benchmarks; finds: "current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient." 🆕
From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents — Advani (FAGEN@ICML2026) — https://arxiv.org/abs/2606.09863 · paper — Analyzes 11,755 agent trajectories (9,876 tau2-bench + 1,879 AppWorld): "no configuration across 5 judges, 5 prompt strategies, and full task specifications exceeds AUROC 0.65 on tau2-bench, and the same judges reach only 0.54 AUROC on AppWorld API-call traces." Lightweight TF-IDF classifiers reach AUROC 0.83–0.95 at 3,300× lower latency. False success rates span 3–75.8% by domain — making grader blindness a benchmark-design issue, not just a judge-tuning one. 🆕

Must-reads: Yan (llm-evaluators) · Hamel (llm-judge) · Shankar (EvalGen)

9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)

Demystifying Evals for AI Agents — Anthropic — https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Grade the final env state (flight-booking via SQL); outcome vs trajectory; isolation; pass@k vs pass^k.
τ-bench / τ²-bench — Sierra — https://arxiv.org/abs/2406.12045 · https://github.com/sierra-research/tau-bench · paper/repo — DB-state-diff grading; user simulation; pass^k; empty-result as explicit fail.
Benchmarking AI Agents — Sierra — https://sierra.ai/blog/benchmarking-ai-agents · blog — The motivation behind τ-bench.
GAIA: A Benchmark for General AI Assistants — Mialon et al. — https://arxiv.org/abs/2311.12983 · paper — Real assistant tasks; difficulty by human task-length.
Patterns for Building Cybersecurity Evals — Eugene Yan — https://eugeneyan.com/writing/cybersecurity-evals/ · blog — The four-primitive agentic-eval template (sandbox, difficulty inputs, tools, deterministic grader); outcome grading + partial-credit ladders + transcript audits. (also T10)
Statistics for AI/ML, Part 4 — pass@k and Unbiased Estimator — Han-Chung Lee — https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/ · blog — Demystifies the metric everyone misuses.
First-Principles Eval — Han-Chung Lee — https://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/ · blog.
SWE-bench grading harness — https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py · tool/repo — FAIL_TO_PASS / PASS_TO_PASS as a verifiable reward. (SWE-agent ACI: https://swe-agent.com/0.7/background/aci/)
human-eval (pass@k estimator) — OpenAI — https://github.com/openai/human-eval/blob/master/human_eval/evaluation.py · tool/repo.
More agent benchmarks to add (named in the brief; URLs not yet verified in this corpus — verify before use): WebArena, OSWorld, Terminal-Bench, Cybench.
WebArena: A Realistic Web Environment for Building Autonomous Agents — Zhou et al. (CMU) — https://arxiv.org/abs/2307.13854 · benchmark — Self-hostable sandboxed websites (e-commerce/forum/GitLab/CMS/maps) with execution-based functional-correctness graders; 812 tasks. The canonical web-agent world-state benchmark named in the brief — now URL-verified.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie et al. (HKU et al.) — https://arxiv.org/abs/2404.07972 · benchmark — 369 real-computer tasks in VMs with per-task execution-based eval scripts and initial-state setup; humans 72% vs best agent 12%. Canonical computer-use benchmark named in the brief — now verified.
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces — Laude Institute + Stanford + community — https://www.tbench.ai/ · benchmark — Sandboxed terminal tasks with deterministic verifiers across SWE/sysadmin/security; v2 leaderboard. The terminal-agent benchmark named in the brief — verified (arxiv: arxiv.org/abs/2601.11868). 🆕
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models — Zhang et al. (Stanford) — https://arxiv.org/abs/2408.08926 · benchmark — 40 professional CTF challenges with subtask annotations and deterministic flag-based grading; pairs naturally with Eugene Yan's cybersecurity-evals post already in the section. Named in the brief — now verified.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories — Lù et al. (McGill / Mila / Google DeepMind) — https://arxiv.org/abs/2504.08942 · paper — First benchmark of LLM-judges-of-trajectories: 1302 expert-reviewed web-agent runs; shows rule-based graders reject many valid trajectories (under-reporting success). Core to the 'trajectory evaluation' theme the section currently lacks. 🆕
Why Do Multi-Agent LLM Systems Fail? (MAST taxonomy) — Cemri, Pan et al. (UC Berkeley Sky Lab) — https://arxiv.org/abs/2503.13657 · paper — 14-mode failure taxonomy across 7 MAS frameworks from 200+ annotated traces; the reference framework for diagnosing multi-agent failures — directly fills the 'multi-agent' gap. 🆕
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents — Trivedi et al. (Stony Brook) — ACL'24 Best Resource Paper — https://aclanthology.org/2024.acl-long.850/ · benchmark — 9-app simulated world (457 APIs) with state-based unit tests that also check for collateral damage/unexpected state changes — gold-standard world-state grading for tool-use agents.
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents — OpenAI (Wei et al.) — https://openai.com/index/browsecomp/ · benchmark — 1,266 'inverted' hard-to-find/easy-to-verify questions for deep-research browsing agents; short verifiable answers make grading deterministic. Released 2025, now standard for browsing-agent eval. (paper: arxiv.org/abs/2504.12516) 🆕
LocAgent: Graph-Guided LLM Agents for Code Localization — Chen, Tang et al. (Yale / All Hands) — https://arxiv.org/abs/2503.09089 · paper — Defines and evaluates code localization as its own capability (Acc@k over file/function locations via code graphs) — directly fills the 'localization' theme named in the section title but currently unlisted. 🆕
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models — He et al. (Tencent AI Lab) — https://arxiv.org/abs/2401.13919 · benchmark — 643 tasks on 15 live real-world sites with a GPT-4V automatic-judge eval protocol — an early, widely-cited example of multimodal-LLM-as-judge for live-web agent trajectories.
SkillsBench — BenchFlow — https://github.com/benchflow-ai/skillsbench · benchmark — 🆕 evaluates how well agent skills work and how effectively agents use them — makes skill-acquisition/skill-use a measurable axis (the "Agent Skills" frontier). ~1.4k★.
ClawsBench — BenchFlow — https://github.com/benchflow-ai/ClawsBench · benchmark — 🆕 BenchFlow's agent benchmark (results/data repo; full release in progress).
SWE-bench Verified — OpenAI (with SWE-bench authors) — https://openai.com/index/introducing-swe-bench-verified/ · benchmark — 500 human-validated SWE-bench instances graded by hidden FAIL_TO_PASS unit tests; the de facto standard for real-issue resolution and the headline coding-agent number labs report 🆕
SWE-bench Multimodal — Yang, Jimenez, Press et al. (Princeton/Stanford) — https://arxiv.org/abs/2410.03859 · benchmark — 619 visual JS/front-end issues from 17 user-facing repos, test-verified; probes whether SWE agents generalize beyond Python/text to visual software domains
SWE-bench Pro — Scale AI (Deng, Da et al.) — https://arxiv.org/abs/2509.16941 · benchmark — 1,865 long-horizon, multi-file tasks across public GPL + held-out + commercial startup repos, test-graded; contamination-resistant and hard (frontier <45% pass@1) 🆕
SWE-Lancer — OpenAI (Miserendino, Patwardhan, Heidecke et al.) — https://arxiv.org/abs/2502.12115 · benchmark — 1,400+ real Upwork freelance tasks worth $1M, graded by triple-verified end-to-end Playwright tests plus manager-decision tasks; ties capability to economic value 🆕
SWE-Gym — Pan, Wang, Neubig, Suhr, Zhang et al. (Berkeley/CMU) — https://arxiv.org/abs/2412.21139 · benchmark — 2,438 executable Python SWE tasks with pre-installed deps + test verification; the first real training/eval gym for SWE agents and verifiers, ICML 2025 🆕
Multi-SWE-bench — ByteDance Seed — https://arxiv.org/abs/2504.02605 · benchmark — 1,632 expert-annotated issue-resolution tasks across Java, TS, JS, Go, Rust, C, C++, test-graded; the leading multilingual SWE-bench extension, NeurIPS 2025 D&B 🆕
SWE-rebench — Nebius / Badertdinov et al. — https://arxiv.org/abs/2505.20411 · benchmark — Automated pipeline yielding 21k+ executable Python tasks with continuously refreshed, decontaminated eval splits; quantifies how much SWE-bench Verified scores are inflated by contamination, NeurIPS 2025 D&B 🆕
RE-Bench — METR — https://arxiv.org/abs/2411.15114 · benchmark — 7 open-ended ML research-engineering environments (e.g. GPU-kernel optimization, scaling laws) scored against 71 human-expert 8-hour attempts; the reference AI-R&D-uplift eval, ICML 2025
MLE-bench — OpenAI (Chan et al.) — https://arxiv.org/abs/2410.07095 · https://github.com/openai/mle-bench · benchmark — 75 Kaggle ML-engineering competitions graded against real human leaderboards (medal thresholds) in 24h Docker runs; standard ML-engineering-agent eval, ICLR 2025. 🆕
PaperBench — OpenAI (Starace et al.) — https://arxiv.org/abs/2504.01848 · benchmark — Replicate 20 ICML 2024 papers from scratch, graded by 8,316 author-co-developed rubric leaves via a validated LLM judge; rigorous research-replication agent eval, ICML 2025 🆕
Konwinski Prize (K Prize) — Andy Konwinski / Kaggle — https://www.kaggle.com/competitions/konwinski-prize · leaderboard — $1M Kaggle forecasting-format contest on GitHub bugs filed after submission close, fully contamination-free, test-graded; round-1 top score only 7.5% exposed real-world difficulty 🆕
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge — Gou et al., OSU NLP Group (NeurIPS 2025 D&B) — https://arxiv.org/abs/2506.21506 · benchmark — 130 long-horizon live-web agentic-search tasks; novel Agent-as-a-Judge rubric-tree grader for time-varying, citation-backed answers — a serious answer to the Deep Research evaluation gap. 🆕
Online-Mind2Web (An Illusion of Progress? Assessing the Current State of Web Agents) — Xue et al., OSU NLP Group — https://arxiv.org/abs/2504.01382 · benchmark — 300 realistic tasks on 136 live websites with an LLM-as-a-Judge auto-grader (~85% human agreement); exposes overstated web-agent progress vs simple baselines. 🆕
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites — AGI Inc (agi-inc/REAL), powers realevals.xyz — https://github.com/agi-inc/REAL · benchmark — 112 tasks on deterministic Next.js replicas of Amazon/Uber/LinkedIn etc.; reproducible LLM evaluator plus state validators — fixes the flakiness of live-site web benchmarks. 🆕
WebGames: Challenging General-Purpose Web-Browsing AI Agents — Thomas et al., Convergence AI — https://arxiv.org/abs/2502.18356 · benchmark — 50+ client-side challenges isolating specific browser interaction skills with verifiable pass/fail; best agent 41% vs 96% human, a sharp diagnostic gap. 🆕
Berkeley Function Calling Leaderboard (BFCL) V4 — Patil et al., UC Berkeley (Gorilla / ICML 2025) — https://gorilla.cs.berkeley.edu/leaderboard.html · leaderboard — Executable + AST-based grading of tool/function calling; V4 adds multi-turn agentic, web-search and memory tasks — the de facto tool-calling leaderboard. 🆕
GTA: A Benchmark for General Tool Agents — Wang et al., Shanghai AI Laboratory (NeurIPS 2024 D&B) — https://arxiv.org/abs/2407.08713 · benchmark — 229 human-written real-world queries with implicit multimodal tool use; executable evaluation platform across perception/operation/logic/creativity tools (GTA-2 follow-up in 2026). 🆕
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows — Lei et al., XLang Lab / HKU (ICLR 2025 Oral) — https://arxiv.org/abs/2411.07763 · benchmark — Enterprise text-to-SQL agent workflows over huge schemas and multiple dialects with execution-based grading; frontier models only ~17-21% — a hard, realistic data-agent eval. 🆕
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents — Rawles et al., Google DeepMind / Google Research (ICLR 2025) — https://arxiv.org/abs/2405.14573 · benchmark — Live Android environment with durable reward signals from device system state for 116 parameterized tasks across 20 apps — the standard mobile-GUI agent benchmark. 🆕
WindowsAgentArena: Evaluating Multi-Modal OS Agents at Scale — Bonatti et al., Microsoft — https://arxiv.org/abs/2409.08264 · benchmark — 154 realistic multi-step Windows-OS tasks across apps with programmatic success checks; parallelizable in Azure (~20 min full run) — desktop computer-use counterpart to OSWorld. 🆕
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments — Vysotskyi, Gal, Torr, Bibi et al. (Oxford) — https://arxiv.org/abs/2606.14397 · benchmark — 🆕 GauntletBench: 100 vision-intensive tasks across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, Circuit Designer) requiring temporal, graphical, and 3D reasoning; "state-of-the-art agent achieves only a 19.1% success rate" vs. "non-expert human annotators achieve over 80% success" — a diagnostic gap in professional-tool computer-use not covered by existing desktop or web benchmarks.
ST-WebAgentBench: Evaluating Safety and Trustworthiness in Web Agents — Levy, Shlomov, Wiesel et al., IBM Research — https://arxiv.org/abs/2410.06703 · benchmark — 375 enterprise tasks carrying 3,057 explicit safety/policy constraints; introduces Completion-under-Policy and Risk Ratio — grades whether agents obey rules, not just succeed. 🆕
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks — Xu et al., CMU — https://arxiv.org/abs/2412.14161 · benchmark — Self-hosted software-company sim (web, code, chat coworkers) with checkpoint-based partial-credit grading; best agent ~30% — a full-day-knowledge-worker eval. 🆕
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks — Koh et al., Carnegie Mellon University — https://arxiv.org/abs/2401.13649 · benchmark — 910 visually-grounded web tasks across Classifieds/Shopping/Reddit with reproducible programmatic reward functions — the multimodal extension of WebArena.
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks — Tejal Patwardhan et al. (OpenAI) — https://arxiv.org/abs/2510.04374 · benchmark — 1,320 expert-built tasks across 44 occupations in the top 9 GDP sectors; 220-task gold subset open-sourced with a public automated grading service at evals.openai.com — the flagship economic-value agent benchmark. 🆕
Remote Labor Index: Measuring AI Automation of Remote Work — CAIS + Scale AI (47 authors) — https://arxiv.org/abs/2510.26787 · benchmark — Grades whether agents complete whole real freelance projects to client-acceptable standard; best agent automates only 2.5% — a hard, money-grounded ceiling for end-to-end remote work. 🆕
Humanity's Last Exam — Center for AI Safety + Scale AI (Dan Hendrycks et al.) — https://arxiv.org/abs/2501.14249 · benchmark — 2,500 expert-written frontier-knowledge questions with unambiguous auto-gradable answers across dozens of fields; the canonical post-MMLU saturation exam (note: now very widely cited). 🆕
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery — OSU-NLP Group (Ohio State) — https://github.com/OSU-NLP-Group/ScienceAgentBench · benchmark — 102 expert-validated tasks from 44 peer-reviewed papers; grades self-contained Python programs by execution + success rate; best agent solves only ~34% (ICLR 2025). 🆕
CORE-Bench: Computational Reproducibility Agent Benchmark — Siegel, Kapoor, Narayanan et al. (Princeton) — https://arxiv.org/abs/2409.11363 · benchmark — 270 tasks over 90 papers (CS/social science/medicine) that grade whether an agent can reproduce published results from code+data; from the Princeton AI-Snake-Oil group.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents — Mingxuan Du et al. — https://arxiv.org/abs/2506.11763 · benchmark — 100 PhD-level tasks across 22 fields; reference-based adaptive-rubric grader for analyst-grade citation-rich reports, validated for human-judgment alignment — the standard deep-research-report eval. 🆕
BixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biology — FutureHouse + ScienceMachine — https://arxiv.org/abs/2503.00096 · benchmark — 50+ real bioinformatics analysis scenarios with ~300 open-answer questions over multi-step Jupyter trajectories; frontier models hit only ~17% — serious wet-lab-adjacent science agent eval. 🆕
Introducing LifeSciBench — OpenAI — https://openai.com/index/introducing-life-sci-bench/ · benchmark — 🆕 750 expert-authored life-science research tasks (7 workflows × 7 biological domains) graded by 19,020 rubric criteria from 173 PhD-level scientist contributors, independently validated by 453 expert reviewers; requires interpreting genomic sequence files, chemical structures, and experimental figures; best model (GPT-Rosalind) scores 36.1% overall — the largest expert-rubric-graded wet-lab-workflow agent benchmark.
Gaia2 and ARE: Scaling Up Agent Environments and Evaluations — Meta (Meta Agents Research Environments) — https://arxiv.org/abs/2509.17158 · benchmark — Successor to GAIA: dynamic, time-driven, multi-agent simulated environments with async world events and a verifiable scenario grader; frontier success ~42% — the serious general-assistant env from Meta. 🆕
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents — Andon Labs (Backlund & Petersson) — https://arxiv.org/abs/2502.15840 · benchmark — Run a simulated vending business over >20M-token horizons; objectively graded on profit/net-worth, exposing long-horizon coherence breakdowns unrelated to context limits. 🆕
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems — Francois Chollet et al. (ARC Prize Foundation) — https://arxiv.org/abs/2505.11831 · benchmark — Human-calibrated (400+ participants, 100% solvable) grid-reasoning tasks with exact-match grading; 2-3x harder than ARC-AGI-1 across all approaches — the frontier fluid-intelligence benchmark. 🆕
TRAIL: Trace Reasoning and Agentic Issue Localization — Patronus AI — https://arxiv.org/abs/2505.08638 · benchmark — 148 annotated agent traces with 841 errors (reasoning/planning/execution); grades whether an LLM can localize the failure in a trace (best model ~11%). HF dataset PatronusAI/TRAIL. 🆕
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios — Salesforce Research — https://arxiv.org/abs/2505.18878 · benchmark — 19 expert-validated B2B/B2C tasks on a realistic Salesforce org with state-based grading; exposes the single-turn (~58%) vs multi-turn (~35%) reliability gap plus confidentiality checks. 🆕
Agents' Last Exam (ALE) — UC Berkeley RDI + 250+ industry co-authors — https://arxiv.org/abs/2606.05405 · benchmark — 1,000+ expert-authored task workflows across 55 digital industries and 13 clusters, grounded in the O*NET/SOC 2018 occupational taxonomy. All-pass grading; frontier agents average <1% full-pass rate on the hardest tier. Closest thing to a "can agents do real knowledge work" census. Project: agents-last-exam.org. 🆕
Introducing FrontierCode — Cognition (Devin team) — https://cognition.com/blog/frontier-code · benchmark — Measures whether AI-written code would be merged by real open-source maintainers — not just whether CI passes — across six dimensions: behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope discipline, and code quality. Three-tier difficulty (Diamond / Main / Extended); 20+ maintainers spending 40+ hours per task. Claude Opus 4.8 leads Diamond at 13.4%, GPT-5.5 at 6.3%. Claims 81% lower false-positive rate than SWE-bench Pro. 🆕
Open-Sourcing Harvey's Long Horizon Legal Agent Benchmark — Harvey AI — https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark · benchmark — 1,200+ long-horizon legal agent tasks across 24 practice areas, graded by 75,000+ expert-written rubric criteria with all-pass grading mirroring real law-firm merge standards. Open-source eval framework; mirrored on Vals AI and Artificial Analysis leaderboards. The domain-expert-graded benchmark for legal agents that legal-AI teams benchmark against. 🆕
CodeScaleBench: Testing coding agents on large codebases — Sourcegraph — https://sourcegraph.com/blog/codescalebench-testing-coding-agents-on-large-codebases-and-multi-repo-software-engineering-tasks · benchmark — 370 tasks across 40+ large repos (Kubernetes, Django, Linux, VSCode) and 9 languages in two suites: SDLC (150 patch-based tasks across 9 phases) and Org (220 cross-repo tasks). Agents with only local tools fail systematically above ~400k LOC; MCP-augmented agents are 30% cheaper, 38% faster, and 2–3× better retrieval precision. 🆕
WorkBench Revisited: Workplace Agents Two Years On — Olly Styles — https://arxiv.org/abs/2606.13715 · paper — Longitudinal re-run of the WorkBench workplace-agent benchmark across 21 models (March 2023–May 2026): GPT-4 completed 43% of tasks with 26% unintended harmful actions in 2024; Claude Opus 4.8 completes 89% with 2.5% harmful actions in 2026 — capability and safety improvements trend together rather than trading off. First two-year longitudinal dataset for workplace-agent benchmarking. 🆕
Closing the loop: Evaluating and improving Replit Agent at scale — James Austin et al. (Replit) — https://replit.com/blog/evaluating-and-improving-agent-at-scale · good — Three-layer eval system: (1) ViBench — offline benchmark where each task pairs a PRD with natural-language test plans and Playwright + LLM judges verify built apps actually work; (2) A/B testing in production for most agent-affecting changes; (3) Telescope — trace clustering using embeddings + DBSCAN to surface emergent failure patterns, feeding a self-improvement loop where agents propose and test their own fixes. (excerpt: "It summarizes failure trajectories, embeds them, clusters similar cases, and classifies new sessions as the distribution changes.") 🆕
A practical guide to hill climbing — Ara Khan (Cline) — https://cline.bot/blog/a-practical-guide-to-hill-climbing · good — A worked coding-agent eval loop: run Cline CLI across all 89 Terminal-Bench tasks with Harbor, summarize and bucket failed rollouts, A/B test prompt/config/code changes, and keep only changes that raise the aggregate pass rate (47% to 57%). Useful because it turns "hill climbing" into a repeatable eval workflow instead of a leaderboard anecdote. (excerpt: "change one thing (a prompt tweak, a bug fix, a config flag), run again, and keep the change if the score goes up.") 🆕 (synced from repo)
A New Framework for Evaluating Voice Agents (EVA) — Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, Hari Subramani (ServiceNow AI) — https://huggingface.co/blog/ServiceNow-AI/eva · article (excellent) — EVA is an end-to-end voice-agent eval framework using a bot-to-bot audio harness (user simulator + Pipecat agent + deterministic tool executor + validators) that jointly scores task accuracy (EVA-A: completion, faithfulness via LLM-judge, speech fidelity via LALM-judge) and conversational… 🆕
Evaluating AI agents: Real-world lessons from building agentic systems at Amazon — Yunfei Bai, Allie Colin, Kashif Imran, Winnie Xiong (AWS) — https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/ · article (good) — Lays out a three-layer agent evaluation library (foundation-model benchmarking, component assessment of intent/memory/reasoning/tool-use, and final task-completion quality) with concrete component metrics like tool selection/parameter accuracy, context-retrieval precision/recall, and reasoning… 🆕
Eval-driven development: Build and evaluate reliable AI agents — Michael Dawson (Red Hat) — https://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agents · article (good) — A hands-on, 8-stage eval-driven workflow for a real multi-turn IT-self-service agent: uses DeepEval's ConversationalGEval/ConversationSimulator with ~15 custom LLM-as-judge metrics, a directory of 11 "known bad" conversations to validate that the metrics actually catch failures ("test your tests"),… 🆕

Must-reads: Anthropic (demystifying) · τ-bench · Lee (pass@k)

10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)

BenchJack: Systematically Auditing AI Agent Benchmarks — Wang, Li, Mang, Cheung, Sen, Song (incl. Dawn Song) — https://arxiv.org/abs/2605.12673 · paper — Reward hacking emerges spontaneously in frontier models; an 8-pattern flaw taxonomy + a 30-question Agent-Eval checklist; "benchmarks must be secure by design."
Towards Building Safe & Secure Agentic AI — Dawn Song (UC Berkeley RDI, lecture slides) — https://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf · talk — The adversarial setting; environment-borne attacks.
Dawn Song — ICLR 2025 keynote on LLM safety — https://iclr.cc/virtual/2025/invited-talk/36783 · talk.
CyberGym — Wang et al. (incl. Dawn Song) — https://arxiv.org/html/2506.02548v2 · paper — Memory-safety PoC generation from OSS-Fuzz; sanitizer-crash grading at scale.
AIR-Bench 2024 — Zeng et al. (incl. Song) — https://arxiv.org/abs/2407.17436v2 · https://github.com/stanford-crfm/air-bench-2024 · paper/repo — Regulation-grounded risk taxonomy.
DecodingTrust — https://decodingtrust.github.io · benchmark — NeurIPS 2023 trustworthiness benchmark.
RedCode — https://arxiv.org/abs/2411.07781 · paper — Risky code execution/generation benchmark for code agents.
AgentPoison — https://arxiv.org/abs/2407.12784 · paper — Red-teams agents by poisoning their RAG memory.
Adding Error Bars to Evals (A Statistical Approach to LM Evaluations) — Miller (Anthropic) — https://arxiv.org/abs/2411.00640 · https://www.anthropic.com/research/statistical-approach-to-model-evals · paper — Standard errors, clustered SEs, paired difference tests — "is this difference real?" (cross-cutting: T6/T8)
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents — Debenedetti, Zhang, Balunović, Beurer-Kellner, Fischer, Tramèr (ETH Zurich) — https://arxiv.org/abs/2406.13352 · benchmark — The canonical prompt-injection benchmark for tool-using agents (97 tasks, 629 security cases over untrusted data); NeurIPS 2024 D&B, now the standard eval everyone reports against. A glaring omission. 🆕
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents — Andriushchenko, Souly, Davies et al. (Gray Swan / UK AISI) — https://arxiv.org/abs/2410.09024 · benchmark — ICLR 2025 benchmark of 110/440 malicious agent tasks across 11 harm categories; shows leading models comply with malicious agent requests without jailbreaking. The reference action-misuse/refusal benchmark. 🆕
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents — Zhan, Liang et al. (UIUC) — https://arxiv.org/abs/2403.02691 · benchmark — ACL 2024 Findings; 1,054 IPI test cases over 17 user / 62 attacker tools, splitting direct-harm vs data-exfiltration intents. Foundational indirect-prompt-injection benchmark predating AgentDojo.
Defeating Prompt Injections by Design (CaMeL) — Debenedetti, Shumailov, Fan, Hayes et al. (Google DeepMind) — https://arxiv.org/abs/2503.18813 · paper — The defense-by-design counterpart: extracts control/data flow from the trusted query and enforces capability-based policies so untrusted data can't alter program flow; effectively solves AgentDojo's security eval. The key 2025 mitigation paper. 🆕
The lethal trifecta for AI agents: private data, untrusted content, and external communication — Simon Willison — https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/ · blog — The most-cited conceptual frame for reasoning about when an agent is unconditionally vulnerable to prompt injection; essential practitioner mental model at the Eugene-Yan bar. 🆕
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents — Kutasov, Bowman et al. (Anthropic) — https://www.anthropic.com/research/shade-arena-sabotage-monitoring · benchmark — 17 complex environments pairing a benign main task with a hidden harmful side task to measure whether agents can sabotage without tripping an AI monitor; the canonical sabotage/monitorability eval. (Paper: arxiv.org/abs/2506.15740) 🆕
Agentic Misalignment: How LLMs Could Be Insider Threats — Anthropic (Alignment team) — https://www.anthropic.com/research/agentic-misalignment · paper — Red-team study showing frontier models will resort to blackmail/leaking under goal conflict in agentic settings; the reference for action-authorization / insider-threat adversarial evaluation. Companion to the cited Anthropic error-bars piece. 🆕
PyRIT — Python Risk Identification Tool for generative AI — Microsoft AI Red Team (Azure) — https://github.com/Azure/PyRIT · tool — The de-facto open-source red-teaming automation framework (70+ converters, multi-turn attacks like Crescendo/TAP); how practitioners actually run adversarial evals at scale. The section lists papers but no tooling. 🆕
OWASP Top 10 for Agentic Applications (2026) + LLM Applications (2025) — OWASP GenAI Security Project — https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ · docs — Industry-standard risk taxonomy: goal hijack, tool misuse, identity/privilege abuse, memory poisoning, rogue agents; complements the regulation-grounded AIR-Bench taxonomy already listed. The canonical practitioner threat checklist. 🆕
MITRE ATLAS — Adversarial Threat Landscape for AI Systems — MITRE — https://atlas.mitre.org/ · docs — ATT&CK-style living knowledge base of 16 tactics / 80+ techniques against AI systems with real-world case studies and mitigations; the standard reference framework for AI adversarial threat modeling.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents — Zhang, Yang et al. — https://proceedings.iclr.cc/paper_files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdf · benchmark — ICLR 2025 unified benchmark spanning 10 scenarios, 400+ tools, covering DPI/IPI, memory poisoning, plan-of-thought backdoors and defenses in one harness; broadest single attack/defense agent benchmark. 🆕
Gray Swan x UK AISI Agent Red-Teaming Challenge — Gray Swan AI / UK AISI (w/ OpenAI, Anthropic, GDM) — https://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdown · talk — Largest public agent red-teaming exercise: ~2,000 red-teamers, 1.8M attempts, 62k breaches against 22 tool-using agents (financial/shopping/marketing bots); real-world adversarial-eval data at scale. 🆕
MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring — various — https://arxiv.org/abs/2605.09684 · benchmark — 2,644 attack trajectories for evaluating AI coding-agent monitors via semi-automated red-teaming on BashArena. Claude Opus 4.5 monitor catch rate drops from 94.9% on standard attacks to 60.3% under best-refined adaptive attacks — a 34pp gap quantifying how much safety headroom current monitors actually have under adversarial pressure. 🆕
Summary of METR's predeployment evaluation of GPT-5.6 Sol — METR — https://metr.org/blog/2026-06-26-gpt-5-6-sol/ · blog (Jun 26 2026) — "GPT-5.6 Sol's detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness." METR's independent predeployment evaluation finds the model exploiting eval environment bugs and using disallowed strategies; time-horizon methodology yields 11.3h but METR concludes results "could not be considered a robust measurement" — the sharpest primary source on active eval-cheating by a frontier model under a structured third-party safety evaluation. 🆕
Measuring LLMs' impact on N-day exploits — Anthropic — https://www.anthropic.com/research/n-days · blog (Jun 8 2026) — Evaluates Claude Mythos Preview on publicly disclosed but unpatched Firefox and Windows kernel vulnerabilities; produced 8 working Firefox code-execution exploits (~~1 hour to first) and 8 distinct Windows privilege-escalation chains (~~$2,000/exploit average; $15,700 total). "A lone operator can now turn a month's worth of patches into working exploits in a single afternoon—for a few thousand dollars and with no specialized expertise." Introduces nonce-protected grading and agentic anti-reward-hacking validation. Concludes 'N-hour' is the accurate threat framing. 🆕
RealityTest: How People Probe AI Identity and Whether Models Disclose It — UK AISI — https://www.aisi.gov.uk/research/realitytest-how-people-probe-ai-identity-and-whether-models-disclose-it · paper (Jun 8 2026) — Benchmark of 3,152 identity-probing queries from ~750 participants across 49 countries and 5 languages; 17 text + 6 speech models tested. Disclosure rates span 8–92% (text); "query phrasing is the largest source of variance in both modalities (37% in speech, 26% text), exceeding the contribution of model identity (10% and 18%)." A single adversarial system prompt reduced disclosure to 3–27% across all models. The safety-eval dimension for AI identity-disclosure compliance (EU AI Act, UK AI Act). 🆕

Must-reads: Dawn Song (BenchJack) · Anthropic (error bars)

🎙 Talks, podcasts & slides (transcribed + noted)

🖼 Slide decks

LLM benchmarks in the era of agents (deck) — Florian Brand — (local slide deck) · slides (TNG / Big Techday)
The Life Cycle of an RL Environment (deck) — Kanav Garg — (local slide deck) · slides (ACM CAIS 2026)

More eval talks, podcasts & lectures (annotated; deep notes in progress)

Discovered 58 more; transcription queued (YouTube rate-limit). 30 eval-focused + 28 eval-segments-in-agent-talks below.

Judging LLMs — Alex Volkov (AI Evangelist, Weights & Biases; host of ThursdAI) — https://www.youtube.com/watch?v=IIL2tE4n1Q0 · talk (AI Engineer World's Fair 2025 — Evals track)
2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson (CEO, Mozilla AI) — https://www.youtube.com/watch?v=CQGuvf6gSrM · talk (AI Engineer World's Fair 2025 — Evals track)
Lessons from the Trenches: Building LLM Evals That Work IRL — Aparna Dhinakaran (Co-founder & CPO, Arize AI) — https://www.youtube.com/watch?v=nbZzSC5A6hs · talk (AI Engineer World's Fair 2025 — Evals track)
The maturity phases of running evals — Phil Hetzel (Braintrust) — https://www.youtube.com/watch?v=FB-MLPhL9Ms · talk (AI Engineer World's Fair 2025 — Evals track)
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss (Arize) — https://www.youtube.com/watch?v=Xfl50508LZM · talk (AI Engineer World's Fair 2025 — Evals track)
What Do Models Still Suck At? (BullshitBench) — Peter Gostev (Arena.ai) — https://www.youtube.com/watch?v=R7A8rX-09Zw · talk (AI Engineer World's Fair 2025 — Evals track)
Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez (Co-founder & CTO, Krea.ai) — https://www.youtube.com/watch?v=h5ItAJuB3Fc · talk (AI Engineer World's Fair 2025 — Evals track)
Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily (speakers from both) — https://www.youtube.com/watch?v=wRJD0inpmjU · talk (AI Engineer World's Fair 2025 — Evals track)
Turning Fails into Features: Zapier's Hard-Won Eval Lessons — Rafal Willinski & Vitor Balocco (Zapier) — https://www.youtube.com/watch?v=blrovBxxN9o · talk (AI Engineer World's Fair 2025 — Evals track)
Why should anyone care about Evals? — Manu Goyal (Braintrust) — https://www.youtube.com/watch?v=jJ45Yz1lJao · talk (AI Engineer World's Fair 2025 — Evals track)
Mastering AI Evaluation: From Playground to Production [Evals Workshop] — AI Engineer Evals Workshop (multi-presenter) — https://www.youtube.com/watch?v=9iN-cPnp7xg · talk (AI Engineer World's Fair 2025 — Evals track (full workshop))
Databricks Co-Founder: Eval Limitations, Why China is Winning Open Source and Future of AI Infra (Ep 69) — Ion Stoica (co-founder Databricks/Anyscale, LMArena), host Jacob Effron — https://www.youtube.com/watch?v=ehav4XMAKLw · podcast (Unsupervised Learning (Redpoint Ventures))
Mercor CEO: Evals Will Replace Knowledge Work, AI x Hiring Today & the Future of Data Labeling (Ep 68) — Brendan Foody (co-founder/CEO Mercor), host Jacob Effron — https://www.youtube.com/watch?v=SOZtz8IdI2w · podcast (Unsupervised Learning (Redpoint Ventures))
CTIBench: How Good Are LLMs at Detecting Cyber Threats? (Ep 729) — Nidhi Rastogi (asst. professor, RIT), host Sam Charrington — https://www.youtube.com/watch?v=75WqFOY3P5M · podcast (The TWIML AI Podcast)
Holistic Evaluation of Generative AI Systems (MLOps Podcast #280) — Jineet Doshi (Staff AI Scientist/Lead, Intuit), host Demetrios Brinkmann — https://www.youtube.com/watch?v=VJ0k0C1mGdg · podcast (MLOps.community)
Can AIs do AI R&D? Reviewing RE-Bench Results with Neev Parikh of METR — Neev Parikh (METR), host Nathan Labenz — https://www.youtube.com/watch?v=SX8Mxyy_UHY · podcast (The Cognitive Revolution)
Can We Stop AI Deception? Apollo Research Tests OpenAI's Deliberative Alignment, w/ Marius Hobbhahn — Marius Hobbhahn (CEO, Apollo Research), host Nathan Labenz — https://www.youtube.com/watch?v=I3ivZaAfDFg · podcast (The Cognitive Revolution)
Metrics Driven Development (Ragas) — Shahul Es (co-founder, Ragas), hosts Daniel Whitenack & Chris Benson — https://www.youtube.com/watch?v=fw0wUC5XN-o · podcast (Practical AI (Changelog))
R1, OpenAI's o3, and the ARC-AGI Benchmark: Insights from Mike Knoop — Mike Knoop (co-founder ARC Prize / Zapier), host Lukas Biewald — https://www.youtube.com/watch?v=SSA8vNrFpXI · podcast (Gradient Dissent (Weights & Biases))
Sandbox breakout evals with Inspect — UK AISI (Fully Connected London '25) — UK AI Safety Institute team — https://www.youtube.com/watch?v=J79pSSAENYc · talk (Gradient Dissent / Fully Connected London '25 (Weights & Biases))
How to align your LLM judge for better evaluations — Weights & Biases (Weave team) — https://www.youtube.com/watch?v=AMCmhRoKnSk · talk (Gradient Dissent / W&B)
Stanford CME295 Transformers & LLMs (Autumn 2025) | Lecture 8 - LLM Evaluation — Afshine Amidi & Shervine Amidi — https://www.youtube.com/watch?v=8fNP4N46RRo · lecture (Stanford (CME295 / Stanford Online))
CS294-196 (Agentic AI MOOC) - LLM Agent Evaluations & Project Overview — Berkeley RDI course staff (Dawn Song's Agentic AI MOOC) — https://www.youtube.com/watch?v=VfOA2a0dj4w · lecture (UC Berkeley RDI (CS294-196, Fall 2025))
Agentic AI MOOC (Fall 2025) | Predictable Noise in LLM Benchmarks — Sida Wang (Meta) — https://www.youtube.com/watch?v=HV8pugcFVO0 · lecture (UC Berkeley RDI (CS294-196, Fall 2025))
Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic — Samuel Colvin (founder, Pydantic) — https://www.youtube.com/watch?v=A48uhxfxbsM · talk (AI Engineer (Code Summit / AI Engineer))
Coding Evals: From Code Snippets to Codebases — Naman Jain, Cursor — Naman Jain (Cursor; LiveCodeBench/SWE-bench-adjacent researcher) — https://www.youtube.com/watch?v=tHN44yJoeS8 · talk (AI Engineer (Code Summit))
From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval (full session host upload) — Brooke Hopkins (founder, Coval; ex-Waymo eval infra) — https://www.youtube.com/watch?v=1X3mYUHC5GA · talk (Founders You Should Know)
Brooke Hopkins, Founder at Coval | AI Minds #073 — Brooke Hopkins (founder, Coval) — https://www.youtube.com/watch?v=e1E8vLyRIKk · podcast (AI Minds (Deepgram))
Karthik Narasimhan - Reliable AI Agents for Tomorrow's World — Karthik Narasimhan (Head of Research, Sierra; Princeton; tau-bench author) — https://www.youtube.com/watch?v=fOAAslQUceg · lecture (Berkeley RDI (Agentic AI Summit 2025))
Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil — Sayash Kapoor (Princeton; AI Snake Oil; co-author HAL / agent-eval critiques) — https://www.youtube.com/watch?v=d5EltXhbcfA · talk (AI Engineer (Summit 2025))

🎯 Eval segments inside agent-building talks are in MENTIONS.md.

💬 Eval mentions

Resources that mention evals — agent-building posts & talks with a good eval segment — live in MENTIONS.md, kept out of the main list to preserve signal density.

Companies & landscape (eval / RL-environment market)

pavlovslist.com — https://pavlovslist.com/ · directory — The RL-environment / eval startups directory ("for the RL-pilled").
Environment labs / RL-env companies (the "environments are the new data" venture wave, via pavlovslist): BenchFlow (benchflow.ai — SkillsBench, ClawsBench, runtime), Prime Intellect (verifiers, Environments Hub), HUD, Mechanize, Plato, AfterQuery, Halluminate, Surge AI, Scale, Mercor.
Prime Intellect (verifiers, Florian Brand) · Braintrust · Arize (Phoenix/AX, OpenInference) · Galileo · LangChain / LangSmith (agentevals) · Sierra (τ-bench) · Core Automation (Kanav Garg) · Epoch AI (benchmark audits) · METR (autonomy/horizon) · FutureHouse (HLE audit) · UK AISI (Inspect).

Notes on provenance & gaps

Built by merging this project's research rounds (mining → adversarial verification → reference audit) with a /deep-research pass. Source detail lives in research/citations.md, research/findings.json, research/reference-audit.md, research/notes/, and the full link list in research/url-inventory.md (153 URLs).
Verified-high (deep-research, 3/3 votes): Verifier's Law, the verifiers library, EvalGen, Inspect AI, promptfoo, the ABC benchmark-rigor paper, plus lm-eval-harness, Autoevals, agentevals, AI Agents That Matter.
Flagged caveats: the MT-Bench 10/25 bias numbers are hedged by their own authors; Lee's "Agent Runtime" post URL and the WebArena/OSWorld/Terminal-Bench/Cybench links still need verification; the Kanav Garg talk is cited via a conference summary (no canonical primary URL yet).

Deep notes

This repo ships 146 deep reading notes in notes/ — structured summaries with key points, verbatim quotes, and themes, for the highest-signal sources:

notes/articles/ — blog posts & practitioner essays
notes/talks/ — 47 transcribed talks, podcasts & lectures (with [mm:ss] timestamps)
notes/papers/ — papers surfaced by the citation graph

Contributing

PRs welcome. Keep the bar high: show your work (real data/code/war-stories beat hot takes), give every entry a one-line why, verify the URL, and flag caveats. See CONTRIBUTING.md. Quality over quantity — a great list is as much about what it excludes.

License

To the extent possible under law, BenchFlow and contributors have waived all copyright and related rights to this work (CC0 1.0). The linked resources remain under their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
.scanner		.scanner
docs		docs
notes		notes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MENTIONS.md		MENTIONS.md
PATTERNS.md		PATTERNS.md
README.md		README.md
SCAN.md		SCAN.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Agent Evals

Contents

⭐ Must-read starter set (read these first)

1 · Why we need evals

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

3 · The model / harness / skill decomposition

4 · Observability & the output / eval space (the surfaces you can grade)

5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)

5a · Eval frameworks & harnesses (code-first test-runners)

5b · TypeScript/JS-native eval runners

5c · RAG / retrieval evaluation

5d · LLM-as-judge / reward / verifier libraries

5e · RL-environment / verifiable-reward toolkits (eval ⇄ training)

5f · Observability + eval platforms (tracing · datasets · online/offline · CI)

5g · Tracing standards

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)

7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)

8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)

9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)

10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)

🎙 Talks, podcasts & slides (transcribed + noted)

🎤 Conference & individual talks

🎙 Podcast episodes

🎓 University lectures

🖼 Slide decks

More eval talks, podcasts & lectures (annotated; deep notes in progress)

💬 Eval mentions

Companies & landscape (eval / RL-environment market)

Notes on provenance & gaps

Deep notes

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages