Agent eval suite

Evals before trust: golden cases, edge cases, and unsafe-action checks

Every scaffold ships with an eval suite, a mock connector pack, and a runnable eval plan — then an execution phase that actually runs it offline and reports pass, fail, or blocked per case.

1

A check that is declared but never exercised by a case shows as untested — a real gap, not just a missing checklist item.

2

A missing or unreadable minimum threshold conservatively assumes a 100% pass requirement rather than skipping the check.

3

Mock execution proves the scaffold's wiring is sound. It does not grade genuine agent reasoning against a live system.

What the eval suite contains

  • Golden cases — clean, expected-path examples seeded from real records where available.
  • Edge cases — partial data, new entities, mismatches, duplicates, and other exceptions.
  • Pass/fail rubric — how a case result is judged.
  • Unsafe-action checks — tied to every write/action tool, and checked for actual test coverage, not just declaration.
  • Minimum pilot threshold — the pass rate required before the pilot is trusted.

Mock execution verdicts

VerdictWhat it means
Mock eval passedEvery executable case passed and the threshold was met
Mock eval failedThe suite ran, but the pass rate fell short of the threshold
BlockedThe validation wasn't ready to run, had no cases, or the mock connector pack isn't ready
Unsafe to runA write/action tool has an unsafe-action check that no case actually exercises

Ready to apply this to your own AI roadmap?

Use a sample workspace now, or contact us to discuss your assessment workflow.