Getting Started

Start by choosing the harness that matches the runtime your app already uses. The harness runs your app and returns normalized output, transcript, tool calls, traces, usage, and errors that Vitest assertions, judges, and reports can read.

Choose a Harness

AI SDKUse when your app calls generateText, streamText, or an AI SDK wrapper.OpenAI AgentsUse when your app owns an Agent and runs it with a Runner.PiUse when your app exposes a Pi agent, toolset, or runtime-compatible entrypoint.Custom HarnessesUse for workflows, service functions, CLIs, RAG pipelines, and custom agents.

Configure Vitest

Keep evals on their own command and Vitest config. The separate config keeps longer provider timeouts, eval-only includes, reporter setup, and replay defaults out of unit tests.

{
  "scripts": {
    "evals": "vitest run --config vitest.evals.config.ts",
    "evals:record": "VITEST_EVALS_REPLAY_MODE=record vitest run --config vitest.evals.config.ts"
  }
}

import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    include: ["evals/**/*.eval.ts"],
    testTimeout: 30_000,
    hookTimeout: 30_000,
    reporters: ["vitest-evals/reporter"],
    env: {
      VITEST_EVALS_REPLAY_MODE:
        process.env.VITEST_EVALS_REPLAY_MODE ?? "auto",
      VITEST_EVALS_REPLAY_DIR: ".vitest-evals/recordings",
    },
  },
});

Write the First Eval

Once the harness is configured, evals should look like ordinary Vitest tests: call run(input), assert deterministic behavior directly, and add judges when you need a score or rationale in reports.

import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import {
  describeEval,
  FactualityJudge,
  ToolCallJudge,
  toolCalls,
} from "vitest-evals";
import { refundHarness } from "./refundHarness";

const judgeHarness = aiSdkJudgeHarness({
  model: openai("gpt-4.1-mini"),
  temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });

describeEval("refund agent", { harness: refundHarness }, (it) => {
  it("approves a refundable invoice", async ({ run }) => {
    const result = await run("Refund invoice inv_123");

    expect(result.output).toMatchObject({ status: "approved" });
    expect(toolCalls(result).map((call) => call.name)).toEqual([
      "lookupInvoice",
      "createRefund",
    ]);
    await expect(result).toSatisfyJudge(ToolCallJudge({ ordered: true }), {
      expectedTools: ["lookupInvoice", "createRefund"],
    });
    await expect(result).toSatisfyJudge(factualityJudge, {
      expected:
        "Invoice inv_123 should be approved and refunded for the full amount.",
      threshold: 0.6,
    });
  });
});

Use Harnesses for adapter setup. Use Session Helpers for transcript and tool history helpers like toolCalls(...). Use Judges when a check should produce a score, threshold, or rationale.

Use Vitest’s case APIs when several scenarios share the same shape. Keep expected values in the row and pass judge criteria explicitly where they are used.

it.for([
  {
    name: "approves refundable invoice",
    input: "Refund invoice inv_123",
    expectedStatus: "approved",
    expectedTools: ["lookupInvoice", "createRefund"],
    expectedFacts:
      "Invoice inv_123 should be approved and refunded for the full amount.",
  },
  {
    name: "denies non-refundable invoice",
    input: "Refund invoice inv_404",
    expectedStatus: "denied",
    expectedTools: ["lookupInvoice"],
    expectedFacts:
      "Invoice inv_404 should be denied because it is not refundable.",
  },
])("$name", async ({ input, expectedFacts, expectedStatus, expectedTools }, { run }) => {
  const result = await run(input);

  expect(result.output).toMatchObject({ status: expectedStatus });
  expect(toolCalls(result).map((call) => call.name)).toEqual(
    expectedTools,
  );
  await expect(result).toSatisfyJudge(factualityJudge, {
    expected: expectedFacts,
    threshold: 0.6,
  });
});

Use StructuredOutputJudge, ToolCallJudge, and FactualityJudge for built-in scored checks. Use Trace Helpers when the behavior you care about is represented by model, tool, or run spans.

HarnessesCompare the supported runtime adapters before choosing one.JudgesUse built-in judges, write custom judges, and set thresholds.UtilitiesFind helper APIs for session, tool-call, and trace checks.Tool ReplayRecord deterministic tool calls without hiding model behavior.Local Report UIInspect JSON eval artifacts, transcripts, tool calls, and traces.GitHub ReportingPublish eval summaries and checks from workflow JSON output.

Getting Started

Choose a Harness

Configure Vitest

Write the First Eval

Next