Testing
Building evals, metrics, and red-team suites to measure agent reliability
Summary
Agent testing measures statistical performance under non-determinism, not deterministic behavior. Covers the full lifecycle: evaluation frameworks and metrics (pass@k, tool-routing accuracy), LLM-as-judge calibration, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming for injection/jailbreak, observability, and CI/CD integration. Start with framework definition, then pick a platform matching your workflow.
- Metrics: pass@k, pass^k, MTTR, tool-routing F1, schema conformance
- LLM-as-judge: bias calibration, when to use
- Platforms: Braintrust (managed), Promptfoo (red-team), Vitest (in-process)
- Red-teaming: injection, jailbreak, OWASP LLM Top 10
- Observability: OpenTelemetry GenAI conventions
- CI integration: PR-gated evals, budget guardrails
Agent testing is fundamentally different from traditional software testing. Traditional testing verifies deterministic behavior — given input X, does output Y occur every time? Agent testing measures statistical performance under non-determinism. An agent may take different paths, call tools in different orders, or succeed on the fifth attempt when it failed on the first. A single success or failure tells you nothing.
This section covers the full testing lifecycle: from evaluation frameworks and metrics (pass@k, tool-routing accuracy) through LLM-as-judge techniques, specific platforms (Braintrust, Promptfoo, Vitest), red-teaming, observability instrumentation, and CI integration.
Quick reference
- Metrics —
/docs/testing/metricscovers pass@k, pass^k, MTTR, tool-routing F1, schema conformance, hallucination rates - LLM-as-judge —
/docs/testing/llm-as-judgeexplains when to use judge models, their biases, and calibration techniques - Evaluation framework —
/docs/testing/evaluation-framework(already here) — dataset, task, scorer, aggregation loop - Braintrust —
/docs/testing/braintrust— end-to-end eval platform with dataset versioning and trace integration - Promptfoo —
/docs/testing/promptfoo— YAML-first evals with strong red-teaming, now part of OpenAI ecosystem - Vitest harness —
/docs/testing/vitest-harness— in-process evals, useful for fast iteration and local debugging - Red-teaming —
/docs/testing/red-teaming— prompt injection, jailbreak, data exfiltration attack templates; OWASP LLM Top 10 mapping - Observability —
/docs/testing/observability— OpenTelemetry GenAI semantic conventions for trace-driven debugging - CI integration —
/docs/testing/ci-integration— PR-gated evals, regression tests, statistical significance, budget guardrails
Start with the evaluation framework, then pick a platform that fits your workflow. If you're running Vitest-based agents, use the Vitest harness. If you want managed experiment tracking, use Braintrust. For quick red-teaming, Promptfoo is the fastest path.