# vitest-evals **Repository Path**: mirrors_getsentry/vitest-evals ## Basic Information - **Project Name**: vitest-evals - **Description**: A vitest extension for running evals. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-12 - **Last Updated**: 2026-03-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # vitest-evals End-to-end evaluation framework for AI agents, built on Vitest. ## Installation ```shell npm install -D vitest-evals ``` ## Quick Start ```javascript import { describeEval } from "vitest-evals"; describeEval("deploy agent", { data: async () => [ { input: "Deploy the latest release to production", expected: "deployed" }, { input: "Roll back the last deploy", expected: "rolled back" }, ], task: async (input) => { const response = await myAgent.run(input); return response; }, scorers: [ async ({ output, expected }) => ({ score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0, }), ], threshold: 0.8, }); ``` ## Tasks Tasks process inputs and return outputs. Two formats are supported: ```javascript // Simple: just return a string const task = async (input) => "response"; // With tool tracking: return a TaskResult const task = async (input) => ({ result: "response", toolCalls: [ { name: "search", arguments: { query: "..." }, result: {...} } ] }); ``` ## Test Data Each test case requires an `input` field. Use `name` to give tests a descriptive label: ```javascript data: async () => [ { name: "simple deploy", input: "Deploy to staging" }, { name: "deploy with rollback", input: "Deploy to prod, roll back if errors" }, ], ``` Additional fields (like `expected`, `expectedTools`) are passed through to scorers. ## Lifecycle Hooks Use `beforeEach` and `afterEach` for setup and teardown: ```javascript describeEval("agent with database", { beforeEach: async () => { await db.seed(); }, afterEach: async () => { await db.clean(); }, data: async () => [{ input: "Find recent errors" }], task: myAgentTask, scorers: [async ({ output }) => ({ score: output.includes("error") ? 1.0 : 0.0 })], }); ``` ## Scorers Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own. ### ToolCallScorer Evaluates if the expected tools were called with correct arguments. ```javascript import { ToolCallScorer } from "vitest-evals"; describeEval("tool usage", { data: async () => [ { input: "Find Italian restaurants", expectedTools: [ { name: "search", arguments: { type: "restaurant" } }, { name: "filter", arguments: { cuisine: "italian" } }, ], }, ], task: myTask, scorers: [ToolCallScorer()], }); // Strict order and parameters scorers: [ToolCallScorer({ ordered: true, params: "strict" })]; // Flexible evaluation scorers: [ToolCallScorer({ requireAll: false, allowExtras: false })]; ``` **Default behavior:** - Strict parameter matching (exact equality required) - Any order allowed - Extra tools allowed - All expected tools required ### StructuredOutputScorer Evaluates if the output matches expected structured data (JSON). ```javascript import { StructuredOutputScorer } from "vitest-evals"; describeEval("query generation", { data: async () => [ { input: "Show me errors from today", expected: { dataset: "errors", query: "", sort: "-timestamp", timeRange: { statsPeriod: "24h" }, }, }, ], task: myTask, scorers: [StructuredOutputScorer()], }); // Fuzzy matching scorers: [StructuredOutputScorer({ match: "fuzzy" })]; // Custom validation scorers: [ StructuredOutputScorer({ match: (expected, actual, key) => { if (key === "age") return actual >= 18 && actual <= 100; return expected === actual; }, }), ]; ``` ### Custom Scorers ```javascript // Inline scorer const LengthScorer = async ({ output }) => ({ score: output.length > 50 ? 1.0 : 0.0, }); // TypeScript scorer with custom options import { type ScoreFn, type BaseScorerOptions } from "vitest-evals"; interface CustomOptions extends BaseScorerOptions { minLength: number; } const TypedScorer: ScoreFn = async (opts) => ({ score: opts.output.length >= opts.minLength ? 1.0 : 0.0, }); ``` ## AI SDK Integration See [`src/ai-sdk-integration.test.ts`](src/ai-sdk-integration.test.ts) for a complete example with the Vercel AI SDK. Transform provider responses to our format: ```javascript const { text, steps } = await generateText({ model: openai("gpt-4o"), prompt: input, tools: { myTool: myToolDefinition }, }); return { result: text, toolCalls: steps .flatMap((step) => step.toolCalls) .map((call) => ({ name: call.toolName, arguments: call.args, })), }; ``` ## Advanced Usage ### Using autoevals For evaluation using the autoevals library: ```javascript import { Factuality, ClosedQA } from "autoevals"; scorers: [ Factuality, ClosedQA.partial({ criteria: "Does the answer mention Paris?", }), ]; ``` ### Skip Tests Conditionally ```javascript describeEval("gpt-4 tests", { skipIf: () => !process.env.OPENAI_API_KEY, // ... }); ``` ### Existing Test Suites For integration with existing Vitest test suites, you can use the `.toEval()` matcher: > **Deprecated**: The `.toEval()` helper is deprecated. Use `describeEval()` instead for better test organization and multiple scorers support. ```javascript import "vitest-evals"; test("capital check", () => { const simpleFactuality = async ({ output, expected }) => ({ score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0, }); expect("What is the capital of France?").toEval( "Paris", answerQuestion, simpleFactuality, 0.8 ); }); ``` ## Configuration ### Separate Eval Configuration Create `vitest.evals.config.ts`: ```javascript import { defineConfig } from "vitest/config"; import defaultConfig from "./vitest.config"; export default defineConfig({ ...defaultConfig, test: { ...defaultConfig.test, include: ["src/**/*.eval.{js,ts}"], }, }); ``` Run evals separately: ```shell vitest --config=vitest.evals.config.ts ``` ## Development ```shell pnpm install pnpm test ```