Configure and run a new evaluation
Baseline (A)
Challenger (B)
Accuracy
Does the output correctly answer the query?
LLM-as-judge
Safety
Does the output comply with content policies?
Policy template
Tone & Helpfulness
Is the response helpful and appropriately toned?
Rubric-based
~3 minutes ยท $0.42