Skip to content

benchflow-ai/awesome-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Agent Evals Awesome

A curated, opinionated, non-BS library of the best resources for building and evaluating AI agents — papers, blog posts, talks, courses, tools, and benchmarks.

Maintained by BenchFlow · Join our Discord

Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:

  • a depth-4 recursive citation crawl (11.6k papers, ranked by in-degree) to surface the academic canon,
  • targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …),
  • 47 talks & podcasts transcribed and deep-noted (verbatim + timestamps), and
  • per-section gap audits with adversarial verification.

443+ curated links · 146 deep reading notes (see notes/). Markers: 🆕 = released/updated 2025–2026 · ⚠️ = caveat. Contributions welcome — see CONTRIBUTING.

📘 Playbook: PATTERNS.md — real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.

Contents


⭐ Must-read starter set (read these first)

  1. The Second Half — Shunyu Yao — https://ysymyth.github.io/The-Second-Half/ · blog — "Evaluation becomes more important than training." The field-level why.
  2. An LLM-as-Judge Won't Save the Product, Fixing Your Process Will — Eugene Yan — https://eugeneyan.com/writing/eval-process/ · blog — Process over tooling; evals as the scientific method.
  3. Hidden Technical Debt: Agent Evaluation Infrastructure — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."
  4. LLM Evals FAQ — Hamel Husain & Shreya Shankar — https://hamel.dev/blog/posts/evals-faq/ · blog — The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.
  5. Asymmetry of Verification and Verifier's Law — Jason Wei — https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — "Ability to verify == ability to create an RL environment."
  6. Demystifying Evals for AI Agents — Anthropic — https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.
  7. How to Build Good Language Modeling Benchmarks — Ofir Press — https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.
  8. AI Agents That Matter — Kapoor, Stroebl, Siegel, Nadgir, Narayanan — https://arxiv.org/abs/2407.01502 · paper — Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.
  9. Building on Evaluation Quicksand — Nathan Lambert — https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — LLM eval has no ground truth; contamination; eval↔training coupling.
  10. Who Validates the Validators? (EvalGen) — Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo (UIST '24) — https://arxiv.org/abs/2404.12272 · paper — "Criteria drift": you can't write the rubric before you grade.
  11. Benches 2026 — "LLM benchmarks in the era of agents" — Florian Brand (Prime Intellect) — https://florianbrand.com/posts/benches-2026 · blog + 61-slide talk — The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack (prompt · sampling temp · grader · harness) swings the score, and that benchmark ground truth is frequently wrong.
  12. A Shared Playbook for Trustworthy Third-Party Evaluations — OpenAI — https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog (Safety, May 2026) — What makes independent evals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need.

1 · Why we need evals

Must-reads: Yao · Yan (eval-process) · Hamel (field-guide)

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

Must-reads: Wei · Lee (RL-env taxonomy)

3 · The model / harness / skill decomposition

Must-reads: Lee (harness) · Brand (Quo vadis)

4 · Observability & the output / eval space (the surfaces you can grade)

Must-reads: Lee (eval infra) · Braintrust (three pillars)

5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)

(All repos URL-verified via GitHub API, Jun 2026. 🆕 = released/expanded 2025–2026. ⚠️ = caveat/discontinued.)

5a · Eval frameworks & harnesses (code-first test-runners)

5b · TypeScript/JS-native eval runners

5c · RAG / retrieval evaluation

5d · LLM-as-judge / reward / verifier libraries

5e · RL-environment / verifiable-reward toolkits (eval ⇄ training)

5f · Observability + eval platforms (tracing · datasets · online/offline · CI)

5g · Tracing standards

  • OpenInference — Arize — https://github.com/Arize-ai/openinference — semantic conventions for agent traces (tool/args/observation/latency/cost).

  • OpenTelemetry GenAI semantic conventionshttps://opentelemetry.io/docs/specs/semconv/gen-ai/ (open-telemetry/semantic-conventions) — 🆕 the vendor-neutral schema (now covers agent orchestration, MCP tool calls, and a quality-evaluation span hook).

  • Braintrust — Braintrust — https://www.braintrust.dev/ · tool — Industry-standard eval+observability platform (Notion, Stripe, Vercel) tying offline experiments to production logs; the section already cites Braintrust's Autoevals but omits the platform itself. 🆕

  • RagaAI Catalyst — RagaAI — https://github.com/raga-ai-hub/RagaAI-Catalyst · tool — OSS agent-observability + eval SDK with multi-agent trace/execution-graph debugging, synthetic-data gen, and guardrail management — covers the online/guardrail-eval slice the section lacks. 🆕

  • OpenAI Cookbook — Evals — OpenAI — https://developers.openai.com/cookbook/topic/evals · docs — Maintained, runnable recipes for building evals (incl. Agents SDK eval, evaluating agents with Langfuse); the practical companion to OpenAI Evals and a curator-grade 'show real work' resource. 🆕 ⚠(unverified URL)

  • Building a better Bugbot — Stefan Heule et al. (Cursor) — https://cursor.com/blog/building-bugbot · excellentBuild your primary eval metric around post-merge signal: Cursor's "resolution rate" uses AI at PR merge time to determine which flagged bugs were actually fixed by the developer, validated by human spot-checks. After 40 major experiments spanning models, prompts, iteration counts, and agentic designs, resolution rate improved from 52% to over 70%; the single largest jump came from switching to a fully agentic architecture. BugBench (a curated set of real diffs with human-annotated bugs) drives offline iteration. (excerpt: "It uses AI to determine, at PR merge time, which bugs were actually resolved by the author in the final code. ... Since launch, we have run 40 major experiments that have increased Bugbot's resolution rate from 52% to over 70%.") 🆕

  • Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents — Scale AI — https://arxiv.org/abs/2605.21347 · paper/tool — 🆕 Automated system that analyzes large corpora of agent execution traces to surface failure patterns and behavioral issues at scale; "Human experts using IG reports improve scaffold performance by 30.4 pp over the unmodified baseline scaffold" — nearly doubling the gain from the next-best system. Fills the gap between laborious per-trace inspection and aggregate benchmark scores that hide population-level failure modes.

Must-reads: Inspect AI · promptfoo · Braintrust · verifiers · DeepEval · Phoenix/Langfuse (pick your observability) · RULER (judge-as-reward)

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)

Must-reads: Press · Kapoor et al. · OpenAI (SWE-bench Verified) · Leaderboard Illusion

7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)

(See also T2 — verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law.)

Must-reads: Lee (RL-env taxonomy) · Garg (lifecycle) · verifiers (repo)

8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)

Must-reads: Yan (llm-evaluators) · Hamel (llm-judge) · Shankar (EvalGen)

9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)

Must-reads: Anthropic (demystifying) · τ-bench · Lee (pass@k)

10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)

Must-reads: Dawn Song (BenchJack) · Anthropic (error bars)


🎙 Talks, podcasts & slides (transcribed + noted)

🎤 Conference & individual talks

🎙 Podcast episodes

🎓 University lectures

🖼 Slide decks

  • LLM benchmarks in the era of agents (deck) — Florian Brand — (local slide deck) · slides (TNG / Big Techday)
  • The Life Cycle of an RL Environment (deck) — Kanav Garg — (local slide deck) · slides (ACM CAIS 2026)

More eval talks, podcasts & lectures (annotated; deep notes in progress)

Discovered 58 more; transcription queued (YouTube rate-limit). 30 eval-focused + 28 eval-segments-in-agent-talks below.

🎯 Eval segments inside agent-building talks are in MENTIONS.md.

💬 Eval mentions

Resources that mention evals — agent-building posts & talks with a good eval segment — live in MENTIONS.md, kept out of the main list to preserve signal density.

Companies & landscape (eval / RL-environment market)

  • pavlovslist.comhttps://pavlovslist.com/ · directory — The RL-environment / eval startups directory ("for the RL-pilled").
  • Environment labs / RL-env companies (the "environments are the new data" venture wave, via pavlovslist): BenchFlow (benchflow.ai — SkillsBench, ClawsBench, runtime), Prime Intellect (verifiers, Environments Hub), HUD, Mechanize, Plato, AfterQuery, Halluminate, Surge AI, Scale, Mercor.
  • Prime Intellect (verifiers, Florian Brand) · Braintrust · Arize (Phoenix/AX, OpenInference) · Galileo · LangChain / LangSmith (agentevals) · Sierra (τ-bench) · Core Automation (Kanav Garg) · Epoch AI (benchmark audits) · METR (autonomy/horizon) · FutureHouse (HLE audit) · UK AISI (Inspect).

Notes on provenance & gaps

  • Built by merging this project's research rounds (mining → adversarial verification → reference audit) with a /deep-research pass. Source detail lives in research/citations.md, research/findings.json, research/reference-audit.md, research/notes/, and the full link list in research/url-inventory.md (153 URLs).
  • Verified-high (deep-research, 3/3 votes): Verifier's Law, the verifiers library, EvalGen, Inspect AI, promptfoo, the ABC benchmark-rigor paper, plus lm-eval-harness, Autoevals, agentevals, AI Agents That Matter.
  • Flagged caveats: the MT-Bench 10/25 bias numbers are hedged by their own authors; Lee's "Agent Runtime" post URL and the WebArena/OSWorld/Terminal-Bench/Cybench links still need verification; the Kanav Garg talk is cited via a conference summary (no canonical primary URL yet).

Deep notes

This repo ships 146 deep reading notes in notes/ — structured summaries with key points, verbatim quotes, and themes, for the highest-signal sources:

Contributing

PRs welcome. Keep the bar high: show your work (real data/code/war-stories beat hot takes), give every entry a one-line why, verify the URL, and flag caveats. See CONTRIBUTING.md. Quality over quantity — a great list is as much about what it excludes.

License

CC0

To the extent possible under law, BenchFlow and contributors have waived all copyright and related rights to this work (CC0 1.0). The linked resources remain under their respective licenses.

About

A curated, non-BS library of the best resources for building and evaluating AI agents — papers, blogs, talks, tools, benchmarks. Maintained by BenchFlow.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors