How to Actually Evaluate an LLM Feature

Traditional software has a comforting property: the same input produces the same output, and a test either passes or fails. Generative AI throws that out. The same prompt can yield different answers, “correct” is a spectrum rather than a boolean, and a change that fixes one case quietly breaks three others you never see. Evaluation is the discipline that replaces the certainty you lost — and most teams ship without it.

Vibes do not scale

Early on, everyone evaluates by hand: try a few prompts, eyeball the answers, ship if they look good. This works for exactly as long as the system is a toy. The moment you have real users and real changes, manual checking collapses. You cannot remember how the model handled an edge case last week, you cannot compare two prompt versions fairly, and you certainly cannot tell whether today’s “improvement” was a regression. The first real engineering step is to write your evaluation down.

Build a golden set

Start with a dataset of representative inputs paired with what a good output looks like — a golden set. It does not need to be huge; fifty to a few hundred well-chosen cases covering your common paths and known failure modes beat ten thousand random ones. Mine it from real usage: the questions users actually asked, the inputs that broke, the edge cases support flagged. Every bug you fix should leave a new test behind so it can never return silently.

Three ways to grade, in increasing cost

Deterministic checks are free and fast where they apply: did the output parse as valid JSON, contain the required fields, stay under the length limit, avoid the banned phrase? Use them everywhere you can; they catch a surprising share of failures.
Model-graded evaluation uses a strong model as a judge: given the input, the answer, and a rubric, is this response faithful, relevant, and complete? It scales to subjective quality that rules cannot capture — but the judge has biases, so validate it against human ratings before you trust it.
Human review remains the gold standard for the cases that matter most. It is slow and expensive, so spend it deliberately: on a sample, on high-stakes outputs, and on calibrating your automated judges.

Measure what users feel, not what is convenient

It is tempting to track metrics that are easy to compute rather than ones that matter. Faithfulness — does every claim trace back to a source? — usually matters more than fluency. For a support assistant, resolution rate beats response length. Pick a small set of metrics tied to the outcome you actually care about, and resist the urge to drown in dashboards that measure everything and inform nothing.

Close the loop in production

Offline evaluation tells you whether a change is safe to ship. Online signals tell you whether it worked. Capture thumbs-up and thumbs-down, track when users rephrase a question (a quiet signal the first answer failed), and watch the rate of “I don’t know” responses. Feed what you learn back into the golden set. The teams that win with generative AI are not the ones with the best prompts — they are the ones with the fastest loop from “this answer was wrong” to “this can never happen again.”

Vibes do not scale

Build a golden set

Three ways to grade, in increasing cost

Measure what users feel, not what is convenient

Close the loop in production

Keep reading

Building RAG Systems That Don't Hallucinate

LLM Agents Beyond the Demo: What Production Actually Looks Like

Fine-Tuning vs Prompting: Choosing the Cheaper Path