The Cost of Chasing Perfect Evals

Watch this happen: a team starts with "we need to measure quality." Totally reasonable. Then someone says "we should measure by use case." Still good. Then "we should track fairness across demographics." Okay. Then "actually, we need custom embeddings because our domain is too specific." Then "wait, shouldn't we account for latency too?" Then "what if we build an LLM-as-judge system that's domain-aware?" Suddenly it's month fourteen and nothing's shipped and the team's arguing about whether they should add Bayesian confidence intervals to their metrics.

Meanwhile, a company down the street deployed garbage evals in month 2, shipped three model iterations, and is already on their second generation. The "perfect" team is still building.

This is what we mean by the perfection trap. And it costs companies millions.

How Good Intentions Turn Into Endless Work

It starts so innocently. "We should measure hallucination." Yes. "We should measure it across different model temperatures." Sure. "We should measure it across user personas to catch fairness issues." Okay, getting bigger. "We need real-time monitoring on every inference." Now you're building infrastructure. "We should build domain-specific embeddings because our use case is special." Now you're hiring. "Oh, and we need Bayesian confidence intervals so we're statistically rigorous." It's month twelve and nothing's shipped.

Here's the thing: each of these decisions, in isolation, is defensible. They're not bad ideas. Your team isn't being lazy or incompetent. They're being thorough. But thoroughness is the enemy of shipping. And in evals, shipping is what matters.

We call this scope creep but it's actually worse than normal scope creep. Every additional metric genuinely does make your evals better. So your team can always justify one more feature. One more validation step. One more refinement. The work is never "obviously unnecessary." It's just... always slightly more work.

The Math of Diminishing Returns

Pareto's principle absolutely applies here. You're probably getting 80% of your eval's actual value from 20% of your metrics. The other 20% of marginal value? That eats 80% of your engineering time.

Look at how this actually breaks down:

Eval Component	Value Generated	Engineering Effort	Ratio (Value/Effort)
Basic quality metrics (BLEU, ROUGE, F1)	40%	5%	8.0x
Domain-specific heuristics	25%	15%	1.67x
LLM-as-judge evaluations	20%	25%	0.8x
Custom embeddings + fairness metrics	10%	40%	0.25x
Real-time inference evals	5%	15%	0.33x

Look at those ratios. You get 40% of your value from the easiest 5% of the work. But chasing that last 10% of value costs 40 times as much as the first part. You could spend $100K and get 65% of the value. Or you could spend $400K and get 75%. Teams always pick the expensive option. Why? Because nobody's watching the clock.

The Curve That Kills Projects

Okay so here's the thing: perfect evals don't exist. The closer you get, the slower everything moves. This is the diminishing returns curve, and it explains why eval projects eat themselves:

This shows it. Going from 70% to 80% takes almost as much time as 0% to 70%. Then 80% to 90%? That's exponential. And 95%? That's not really achievable. You're just burning time.

Teams don't plan to be on the right side of this curve. They plan to ship at 70%, which is smart. But then they see they're "so close" to 80%, add one more metric. Then another. Then they're at 85% with nothing to show for five extra months of work. And they've already burned the goodwill they need to actually ship.

What Actually Works: The Ship-First Approach

Smart teams do this instead. They define "good enough" and move on.

Phase 1: Get Something Out There
Can it detect when your model gets worse? Can you compare v1 to v2? Can you catch your biggest failure mode? If all three are yes, ship it. Seriously.

Phase 2 (Weeks 4-8 After Launch)
Add a couple secondary metrics. But only based on what you actually saw fail in production. Not what you predicted would fail. What did. Refine the dashboard a little.

Phase 3 (Weeks 8-16 After Launch)
Now add domain-specific stuff. Automate the boring checks. Set up alerts. You've got real data now so you know what matters.

Phase 4+ (Month 4+)
Only build new eval stuff if it directly ties to something business-relevant. If you don't know why you're building it, don't. Integrate into your deployment pipeline.

Notice what's not here: "achieve perfection." Because you're never going to. You ship at 60-70%, you learn what matters, and you build from data instead of imagination.

The Cost of Waiting: Two Teams Side By Side

The "Let's Do This Right" Team

Months 1-2: Design the framework. Long debates about what metrics matter. Decide on 12 core metrics.
Months 2-4: Build infrastructure. Realize they need a database. Then a pipeline. Then real-time aggregation.
Months 4-6: Code the metrics. First four work. Eight more expose problems. Redesign.
Months 6-9: Test and validate. Edge cases keep appearing. Domain experts get involved.
Months 9-12: "Almost there." Someone suggests fairness metrics. Someone else says they need user-specific context. More logging.
Month 12+: Still not shipped. Leadership's mad. Team's burned out. Nothing's been delivered.

Their product team? Stuck. Can't ship improvements without evals. Operating completely blind on how good their model actually is.

The "Ship Now, Iterate Later" Team

Week 1-2: What's the main success metric? What's the biggest way this breaks? Answer both questions.
Week 2-4: Implement in Python. GPT-4 as judge. Standard metrics library. Done. 200 lines of code.
Week 4: Deploy it. Run evals. Get baseline numbers.
Week 5-8: Look at actual failure data. What do users actually complain about? Fix that.
Week 8: Ship better model. Measure the impact.
Week 9-12: Build iteration 2 of evals. Use real production data to decide what to add.
Month 4+: Evals are working. You've shipped three model iterations. You're shipping 2x faster than everyone else.

One month to evals versus twelve months. By the time the "right" team launches, this team has already shipped twice.

The Dollar Cost: Why Perfection Destroys Companies

Let's just do the math here.

Perfect evals: 12 months to get 85% quality
Fast evals: 4 weeks to get 60% quality, then iterate

Assumptions:
- Each eval delay = each month of delayed model improvements
- Delayed iteration = $50K+ lost revenue/month
- 2% model improvement = $200K new revenue

Perfect approach cost:
- 2 engineers × 12 months: $400K in salary
- Lost revenue from 12 months of no improvements: $600K
- Total sunk: $1M+ for a system that isn't even live yet

Fast approach cost:
- 2 engineers × 4 weeks: ~$30K
- Evals live week 4, improvements ship week 8
- By month 12, you've shipped 6 iterations
- Total: $30K to be months ahead of the competition
        

The perfect team spent 30x more money to deliver nothing. The fast team spent $30K and shipped a working product months earlier. And by month 12, they're so far ahead that the perfect team's even playing catch-up.

Good Enough Isn't Garbage. It's Just Honest.

We're not saying build garbage. We're saying don't optimize for precision when you should optimize for speed. "Good enough" means:

Focused: You're measuring the 2-3 things that'll actually move the needle. Not everything possible.
Repeatable: Run it twice, get basically the same answer. Not perfect precision, but consistent.
Useful: When it fails, you know what to do. If it doesn't tell you anything actionable, delete it.
Running: This matters most. The eval has to actually exist. A perfect eval that never ships is worth less than nothing.

A 70%-rigorous eval that runs every day beats a 95%-rigorous eval that ships next year. By a lot.

The Trap Disguises Itself as Responsibility

Here's why this trap is so sneaky. Your ML engineer says "we should test fairness across demographics." That's right. Your PM says "we need to handle all our use cases." Also right. Your CEO asks "what about edge cases?" Legitimate. Each individual decision makes sense. But together they add six months to your timeline.

At some point you've got to ship. An 80% eval that's live is worth infinitely more than a 95% eval that doesn't exist.

What actually works: ship the minimum viable thing. Run it in production. See what breaks. Build iteration 2 based on reality instead of imagination. Your real failure modes are never what you predicted. So let production data guide you instead of theorizing.

The One Question That Saves Months

When your team pitches a new eval metric, ask: "What decision breaks if we don't have this?" If the answer is "honestly, not sure" or "we could infer it from other metrics," don't build it. Only build evals where not having it means you ship something broken.

That question is the only discipline you need. And it saves months. Ship evals faster than the company debating whether they need fairness metrics. Iterate faster. Win.