Techniques

Evaluation (Evals)

Also known as: evals, benchmark

In one line

A structured test of an AI's accuracy, safety, or usefulness on specific tasks.

What does Evaluation (Evals) mean?

Evals are essential for shipping AI products responsibly. They mix automated scoring (exact match, BLEU, LLM-as-judge) with human review to catch regressions.

A real-world example

Running a 200-question eval set every time you change the system prompt to check nothing regressed.

Related terms

Large Language Model (LLM)

A neural network trained on huge text collections to predict the next word — the engine behind ChatGPT, Claude and Gemini.

Training

The expensive process of teaching a model by adjusting its weights on huge amounts of data.

Alignment

The problem of making AI do what humans actually want — safely and helpfully.