Techniques

Evaluation (Evals)

Also known as: evals, benchmark

In one line

A structured test of an AI's accuracy, safety, or usefulness on specific tasks.

What does Evaluation (Evals) mean?

Evals are essential for shipping AI products responsibly. They mix automated scoring (exact match, BLEU, LLM-as-judge) with human review to catch regressions.

A real-world example

Running a 200-question eval set every time you change the system prompt to check nothing regressed.

Related terms