Getting Started
Start by choosing the harness that matches the runtime your app already uses. The harness runs your app and returns normalized output, transcript, tool calls, traces, usage, and errors that Vitest assertions, judges, and reports can read.
Choose a Harness
Section titled “Choose a Harness”generateText, streamText, or an AI SDK wrapper.OpenAI AgentsUse when your app owns an Agent and runs it with a Runner.PiUse when your app exposes a Pi agent, toolset, or runtime-compatible entrypoint.Custom HarnessesUse for workflows, service functions, CLIs, RAG pipelines, and custom agents.Configure Vitest
Section titled “Configure Vitest”Keep evals on their own command and Vitest config. The separate config keeps longer provider timeouts, eval-only includes, reporter setup, and replay defaults out of unit tests.
{ "scripts": { "evals": "vitest run --config vitest.evals.config.ts", "evals:record": "VITEST_EVALS_REPLAY_MODE=record vitest run --config vitest.evals.config.ts" }}import { defineConfig } from "vitest/config";
export default defineConfig({ test: { include: ["evals/**/*.eval.ts"], testTimeout: 30_000, hookTimeout: 30_000, reporters: ["vitest-evals/reporter"], env: { VITEST_EVALS_REPLAY_MODE: process.env.VITEST_EVALS_REPLAY_MODE ?? "auto", VITEST_EVALS_REPLAY_DIR: ".vitest-evals/recordings", }, },});Write the First Eval
Section titled “Write the First Eval”Once the harness is configured, evals should look like ordinary Vitest tests:
call run(input), assert deterministic behavior directly, and add judges when
you need a score or rationale in reports.
import { openai } from "@ai-sdk/openai";import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";import { expect } from "vitest";import { describeEval, FactualityJudge, ToolCallJudge, toolCalls,} from "vitest-evals";import { refundHarness } from "./refundHarness";
const judgeHarness = aiSdkJudgeHarness({ model: openai("gpt-4.1-mini"), temperature: 0,});const factualityJudge = FactualityJudge({ judgeHarness });
describeEval("refund agent", { harness: refundHarness }, (it) => { it("approves a refundable invoice", async ({ run }) => { const result = await run("Refund invoice inv_123");
expect(result.output).toMatchObject({ status: "approved" }); expect(toolCalls(result).map((call) => call.name)).toEqual([ "lookupInvoice", "createRefund", ]); await expect(result).toSatisfyJudge(ToolCallJudge({ ordered: true }), { expectedTools: ["lookupInvoice", "createRefund"], }); await expect(result).toSatisfyJudge(factualityJudge, { expected: "Invoice inv_123 should be approved and refunded for the full amount.", threshold: 0.6, }); });});Use Harnesses for adapter setup. Use
Session Helpers for transcript and tool
history helpers like toolCalls(...). Use Judges when a check
should produce a score, threshold, or rationale.
Use Vitest’s case APIs when several scenarios share the same shape. Keep expected values in the row and pass judge criteria explicitly where they are used.
it.for([ { name: "approves refundable invoice", input: "Refund invoice inv_123", expectedStatus: "approved", expectedTools: ["lookupInvoice", "createRefund"], expectedFacts: "Invoice inv_123 should be approved and refunded for the full amount.", }, { name: "denies non-refundable invoice", input: "Refund invoice inv_404", expectedStatus: "denied", expectedTools: ["lookupInvoice"], expectedFacts: "Invoice inv_404 should be denied because it is not refundable.", },])("$name", async ({ input, expectedFacts, expectedStatus, expectedTools }, { run }) => { const result = await run(input);
expect(result.output).toMatchObject({ status: expectedStatus }); expect(toolCalls(result).map((call) => call.name)).toEqual( expectedTools, ); await expect(result).toSatisfyJudge(factualityJudge, { expected: expectedFacts, threshold: 0.6, });});Use StructuredOutputJudge, ToolCallJudge, and FactualityJudge for built-in scored checks. Use Trace Helpers when the behavior you care about is represented by model, tool, or run spans.