Fight Evils with Evals!

Benchmarks measure benchmarks. Your system needs its own measures.

Every new model arrives wearing a tuxedo of benchmarks.

MMLU: 92.4%. HumanEval: 87.2%. LLeMU: 88.7%. MATH: 73.6%. AGI: 127%!

Yet, for 99% of businesses building process & product with AI, none of it matters.

What matters? How are YOUR workloads doing? Getting better or worse? The only sane way to know that is to write Evals (tests for LLMs) that reflect the specific tasks, data, and failure modes of your system.

The benchmarks are not lying. They are answering someone else’s question.

What “Vibes-Based Evaluation” Actually Costs

The standard approach: ship a model change, watch the complaint channels, roll back if the room gets loud.

That misses almost everything interesting:

You only catch loud failures. Users who get a confidently wrong answer and don’t realize it? Silent. Users who get a worse answer and abandon the feature? Silent. Support tickets and error rates capture only a fraction of quality regression.

You can’t distinguish regressions from improvements. If the new model is better at task A and worse at task B, complaints about B look identical to generic “the AI got worse” feedback. You don’t know what to fix.

You’re using your users as test infrastructure. They didn’t sign up for that.

The Eval Spectrum (and Where Most Teams Get It Wrong)

Evaluation approaches sit on a spectrum from “fast but flimsy” to “expensive but valid.”

A spectrum diagram comparing deterministic checks, LLM-as-judge, and human evaluation by speed, cost, and validity. — Use the cheapest evaluation method that can honestly catch the failure.

LLM-as-judge is the current darling: ask a powerful model to grade another model’s outputs. Fast, scalable, cheap. The problem: it bakes in the grader model’s biases, can be gamed, and creates a circular dependency. If you use GPT-5 to grade GPT-5’s outputs, you’re measuring something like “how much does GPT-5 agree with GPT-5.” That’s not nothing, but it’s not what you think.

Human eval is the gold standard everyone tries to skip. Getting humans to evaluate outputs is expensive, slow, inconsistent across evaluators, and annoying to schedule. But it is the only thing that validates whether your system is useful to real humans.

Task-specific automated checks are where most teams should spend more time. They are not glamorous, but they are fast, deterministic, and tied to what matters in your system.

What Actually Works

1. Define Failure Before You Ship

Before changing a model or prompt, write down what bad looks like. Specifically.

Not “the output should be accurate.” That’s not a test. More like:

Structured JSON output must parse without errors
All citations in the response must appear verbatim in the retrieved context
Responses must not mention competitor product names
SQL queries must be syntactically valid and reference only tables that exist in the schema
Sentiment classification must not flip from positive to negative more than 3% of the time on the existing test set

You can check these programmatically. No judge model required.

Eval harness: deterministic checks

type EvalResult = { passed: boolean; reason?: string };

const evals: Record<string, (output: string, context: EvalContext) => EvalResult> = {
  // JSON must parse
  validJson: (output) => {
    try {
      JSON.parse(output);
      return { passed: true };
    } catch (e) {
      return { passed: false, reason: `Invalid JSON: ${e.message}` };
    }
  },

  // No hallucinated citations — every claim must appear in context
  groundedCitations: (output, { retrievedChunks }) => {
    const claims = extractCitations(output);
    const ungrounded = claims.filter(
      (claim) => !retrievedChunks.some((chunk) => chunk.includes(claim))
    );
    return ungrounded.length === 0
      ? { passed: true }
      : { passed: false, reason: `Ungrounded claims: ${ungrounded.join(', ')}` };
  },

  // Response length sanity check — catch truncation or runaway generation
  reasonableLength: (output) => {
    const words = output.split(/\s+/).length;
    return words >= 10 && words <= 2000
      ? { passed: true }
      : { passed: false, reason: `Word count ${words} out of bounds` };
  },
};

2. Build a Golden Set From Your Worst Days

Your best evaluation data is the embarrassing stuff: the outputs that made someone file a ticket, screenshot a hallucination, or quietly stop using the feature.

Every time a user reports a bad output, flags a hallucination, or you notice a failure manually, add it to your golden set: the input, the context, and the correct behavior. Keep 50-100 cases and run them on every model change.

This feels manual at first. After six months, you have a test suite no public benchmark can game, because every case came from your own failure history.

A workflow diagram showing how bad production incidents become golden cases, then CI eval runs, then blocked regressions or approved releases. — A golden set turns the embarrassing stuff into a regression suite.

Golden case shape

interface GoldenCase {
  id: string;
  input: string;
  context: Record<string, unknown>;
  expectedBehavior: {
    mustContain?: string[];
    mustNotContain?: string[];
    structureCheck?: (output: string) => boolean;
    minSimilarityToReference?: number; // cosine similarity to a reference answer
  };
  sourceIncident?: string; // link back to the bug report or ticket
}

3. Regression Testing, Not Just Acceptance Testing

Most teams run evals only when considering a model change. That’s acceptance testing: “is this new thing good enough?”

You also need regression testing: “did this break something that used to work?”

Run your golden set on every prompt change, not just model changes. A prompt that was working fine can silently degrade when you add a new tool, change a RAG retrieval strategy, or update your context template. You won’t know without a baseline. Tools like Langfuse attach eval scores to production traces so regression shows up in dashboards, not just in incident reports.

Eval harness: baseline vs candidate comparison

async function compareModelVersions(
  goldenCases: GoldenCase[],
  baselinePipeline: Pipeline,
  candidatePipeline: Pipeline
) {
  const results = await Promise.all(
    goldenCases.map(async (tc) => {
      const [baseline, candidate] = await Promise.all([
        baselinePipeline.run(tc.input, tc.context),
        candidatePipeline.run(tc.input, tc.context),
      ]);

      return {
        id: tc.id,
        baselinePassed: runEvals(baseline, tc.expectedBehavior),
        candidatePassed: runEvals(candidate, tc.expectedBehavior),
        regression: /* baseline passed */ && /* candidate failed */,
        improvement: /* baseline failed */ && /* candidate passed */,
      };
    })
  );

  const regressions = results.filter((r) => r.regression);
  const improvements = results.filter((r) => r.improvement);

  console.log(`Regressions: ${regressions.length} / ${goldenCases.length}`);
  console.log(`Improvements: ${improvements.length} / ${goldenCases.length}`);

  if (regressions.length > 0) {
    console.error('Blocking regressions found:');
    regressions.forEach((r) => console.error(` - ${r.id}`));
  }

  return { regressions, improvements };
}

If a candidate regresses on known failures, the upgrade conversation gets wonderfully specific: which cases improved, which cases broke, and whether the trade is worth it.

4. Use LLM-as-Judge for Exactly One Thing

LLM-as-judge is useful for open-ended outputs where there is no deterministic right answer: “is this response helpful?”, “does this summary preserve the key points?”, “is this explanation right for a beginner?”

Use it there. Don’t use it for deterministic answers. When you do use it, make the grading rubric explicit:

Eval harness: rubric-based judge

async function judgeHelpfulness(
  userQuery: string,
  modelResponse: string
): Promise<{ score: number; reasoning: string }> {
  const judgePrompt = `
You are evaluating a customer support response.

User question: ${userQuery}
Response: ${modelResponse}

Rate the response on a scale of 1-5:
5 = Directly answers the question with accurate, actionable information
4 = Answers the question but could be more specific or actionable
3 = Partially addresses the question; key information is missing
2 = Tangentially related but doesn't answer the question
1 = Off-topic, factually wrong, or harmful

Respond with JSON: {"score": <number>, "reasoning": "<one sentence>"}
`;

  const result = await judgeModel.generate(judgePrompt);
  return JSON.parse(result);
}

An explicit rubric reduces evaluator variance, gives you interpretable output, and makes it easier to audit when the judge is wrong. Libraries like Autoevals and Braintrust ship prebuilt rubrics for common tasks — worth stealing before writing your own from scratch.

Tools Worth Knowing

You don’t have to build all of this from scratch. Several tools have made serious progress on the eval infrastructure problem:

Braintrust — Full eval platform with experiment tracking, dataset management, and scoring functions. Organizes eval runs by prompt, model, and deployment so you can diff quality over time, not just across releases. Pairs well with their open-source Autoevals library, which ships prebuilt model-graded scoring functions for common tasks (factual accuracy, helpfulness, toxicity, semantic similarity).

Langfuse — Open-source LLM observability that sits between your app and your models. Traces every call, attaches eval scores (human or automated) to individual spans, and surfaces quality trends over production traffic. Good choice if you want observability and evals in the same tool rather than a separate eval harness.

Evalite — TypeScript-native eval framework by Matt Pocock. Low ceremony: define a task, define a scorer, run it in your existing test setup. Targets teams who want evals that feel like unit tests rather than a separate ML experiment platform.

promptfoo — CLI-first eval runner focused on prompt comparison and red-teaming. Easy to configure via YAML, integrates with most model providers, and has built-in support for detecting prompt injection and other adversarial inputs.

deepeval — Python eval framework with a large library of built-in metrics (G-Eval, RAG faithfulness, answer relevancy, hallucination detection). Useful for RAG pipelines where you want specific grading for retrieval quality, not just generation quality.

The right tool depends on your stack and where you’re starting from. What matters more than the choice of framework is the discipline of running evals at all — consistently, on every significant change.

The Uncomfortable Part

Most teams skip this because it asks an irritating question early: what would “good” look like here?

That is genuinely hard for a new AI feature. It is also non-optional if you care about reliability. Teams that ship trustworthy AI are doing the same thing they’d do for any critical code path: define expected behavior, test it, and run those tests continuously.

The benchmarks are not lying. They are answering someone else’s question. Stop reading them as product roadmaps and start writing tests that match your system.

Your users will notice before your dashboards do. Build the test suite first.