Skip to content

Getting Started

Start by choosing the harness that matches the runtime your app already uses. The harness runs your app and returns normalized output, transcript, tool calls, traces, usage, and errors that Vitest assertions, judges, and reports can read.

Keep evals on their own command and Vitest config. The separate config keeps longer provider timeouts, eval-only includes, reporter setup, and replay defaults out of unit tests.

package.json
{
"scripts": {
"evals": "vitest run --config vitest.evals.config.ts",
"evals:record": "VITEST_EVALS_REPLAY_MODE=record vitest run --config vitest.evals.config.ts"
}
}
vitest.evals.config.ts
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
include: ["evals/**/*.eval.ts"],
testTimeout: 30_000,
hookTimeout: 30_000,
reporters: ["vitest-evals/reporter"],
env: {
VITEST_EVALS_REPLAY_MODE:
process.env.VITEST_EVALS_REPLAY_MODE ?? "auto",
VITEST_EVALS_REPLAY_DIR: ".vitest-evals/recordings",
},
},
});

Once the harness is configured, evals should look like ordinary Vitest tests: call run(input), assert deterministic behavior directly, and add judges when you need a score or rationale in reports.

evals/refund.eval.ts
import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import {
describeEval,
FactualityJudge,
ToolCallJudge,
toolCalls,
} from "vitest-evals";
import { refundHarness } from "./refundHarness";
const judgeHarness = aiSdkJudgeHarness({
model: openai("gpt-4.1-mini"),
temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });
describeEval("refund agent", { harness: refundHarness }, (it) => {
it("approves a refundable invoice", async ({ run }) => {
const result = await run("Refund invoice inv_123");
expect(result.output).toMatchObject({ status: "approved" });
expect(toolCalls(result).map((call) => call.name)).toEqual([
"lookupInvoice",
"createRefund",
]);
await expect(result).toSatisfyJudge(ToolCallJudge({ ordered: true }), {
expectedTools: ["lookupInvoice", "createRefund"],
});
await expect(result).toSatisfyJudge(factualityJudge, {
expected:
"Invoice inv_123 should be approved and refunded for the full amount.",
threshold: 0.6,
});
});
});

Use Harnesses for adapter setup. Use Session Helpers for transcript and tool history helpers like toolCalls(...). Use Judges when a check should produce a score, threshold, or rationale.

Use Vitest’s case APIs when several scenarios share the same shape. Keep expected values in the row and pass judge criteria explicitly where they are used.

evals/refund.eval.ts
it.for([
{
name: "approves refundable invoice",
input: "Refund invoice inv_123",
expectedStatus: "approved",
expectedTools: ["lookupInvoice", "createRefund"],
expectedFacts:
"Invoice inv_123 should be approved and refunded for the full amount.",
},
{
name: "denies non-refundable invoice",
input: "Refund invoice inv_404",
expectedStatus: "denied",
expectedTools: ["lookupInvoice"],
expectedFacts:
"Invoice inv_404 should be denied because it is not refundable.",
},
])("$name", async ({ input, expectedFacts, expectedStatus, expectedTools }, { run }) => {
const result = await run(input);
expect(result.output).toMatchObject({ status: expectedStatus });
expect(toolCalls(result).map((call) => call.name)).toEqual(
expectedTools,
);
await expect(result).toSatisfyJudge(factualityJudge, {
expected: expectedFacts,
threshold: 0.6,
});
});

Use StructuredOutputJudge, ToolCallJudge, and FactualityJudge for built-in scored checks. Use Trace Helpers when the behavior you care about is represented by model, tool, or run spans.