Operationalize model quality with Agent-as-Judge evaluations

A new built-in evaluation system lets you automate LLM quality checks with binary and numeric scoring, background execution, post-hooks, and customizable evaluator agents. This makes it easier to standardize evals, gate releases, and compare models — without bolting on external systems.

Details

Run evaluations in the background to keep pipelines responsive
Use post-hooks to persist metrics, trigger alerts, or update dashboards
Create custom evaluator agents to encode domain-specific criteria

Who this is for: AI platform teams, ML engineers, and QA leads who need consistent, auditable evaluation workflows at scale.