Engineering Efficiency
Vikesh February 18, 2025 13 min read

The Cost of Chasing Perfect Evals

Watch this happen: a team starts with "we need to measure quality." Totally reasonable. Then someone says "we should measure by use case." Still good. Then "we should track fairness across demographics." Okay. Then "actually, we need custom embeddings because our domain is too specific." Then "wait, shouldn't we account for latency too?" Then "what if we build an LLM-as-judge system that's domain-aware?" Suddenly it's month fourteen and nothing's shipped and the team's arguing about whether they should add Bayesian confidence intervals to their metrics.

Meanwhile, a company down the street deployed garbage evals in month 2, shipped three model iterations, and is already on their second generation. The "perfect" team is still building.

This is what we mean by the perfection trap. And it costs companies millions.

How Good Intentions Turn Into Endless Work

It starts so innocently. "We should measure hallucination." Yes. "We should measure it across different model temperatures." Sure. "We should measure it across user personas to catch fairness issues." Okay, getting bigger. "We need real-time monitoring on every inference." Now you're building infrastructure. "We should build domain-specific embeddings because our use case is special." Now you're hiring. "Oh, and we need Bayesian confidence intervals so we're statistically rigorous." It's month twelve and nothing's shipped.

Here's the thing: each of these decisions, in isolation, is defensible. They're not bad ideas. Your team isn't being lazy or incompetent. They're being thorough. But thoroughness is the enemy of shipping. And in evals, shipping is what matters.

We call this scope creep but it's actually worse than normal scope creep. Every additional metric genuinely does make your evals better. So your team can always justify one more feature. One more validation step. One more refinement. The work is never "obviously unnecessary." It's just... always slightly more work.

The Math of Diminishing Returns

Pareto's principle absolutely applies here. You're probably getting 80% of your eval's actual value from 20% of your metrics. The other 20% of marginal value? That eats 80% of your engineering time.

Look at how this actually breaks down:

Eval Component Value Generated Engineering Effort Ratio (Value/Effort)
Basic quality metrics (BLEU, ROUGE, F1) 40% 5% 8.0x
Domain-specific heuristics 25% 15% 1.67x
LLM-as-judge evaluations 20% 25% 0.8x
Custom embeddings + fairness metrics 10% 40% 0.25x
Real-time inference evals 5% 15% 0.33x

Look at those ratios. You get 40% of your value from the easiest 5% of the work. But chasing that last 10% of value costs 40 times as much as the first part. You could spend $100K and get 65% of the value. Or you could spend $400K and get 75%. Teams always pick the expensive option. Why? Because nobody's watching the clock.

The Curve That Kills Projects

Okay so here's the thing: perfect evals don't exist. The closer you get, the slower everything moves. This is the diminishing returns curve, and it explains why eval projects eat themselves:

GOOD REGION High ROI: Ship Here DANGER ZONE Diminishing returns Engineering Effort & Time → Eval Value & Quality 80% Value Target Week 4: 40% Week 12: 70% Week 20: 80% Week 36: 85% Week 52: 88%

This shows it. Going from 70% to 80% takes almost as much time as 0% to 70%. Then 80% to 90%? That's exponential. And 95%? That's not really achievable. You're just burning time.

Teams don't plan to be on the right side of this curve. They plan to ship at 70%, which is smart. But then they see they're "so close" to 80%, add one more metric. Then another. Then they're at 85% with nothing to show for five extra months of work. And they've already burned the goodwill they need to actually ship.

What Actually Works: The Ship-First Approach

Smart teams do this instead. They define "good enough" and move on.

Phase 1: Get Something Out There
Can it detect when your model gets worse? Can you compare v1 to v2? Can you catch your biggest failure mode? If all three are yes, ship it. Seriously.
Phase 2 (Weeks 4-8 After Launch)
Add a couple secondary metrics. But only based on what you actually saw fail in production. Not what you predicted would fail. What did. Refine the dashboard a little.
Phase 3 (Weeks 8-16 After Launch)
Now add domain-specific stuff. Automate the boring checks. Set up alerts. You've got real data now so you know what matters.
Phase 4+ (Month 4+)
Only build new eval stuff if it directly ties to something business-relevant. If you don't know why you're building it, don't. Integrate into your deployment pipeline.

Notice what's not here: "achieve perfection." Because you're never going to. You ship at 60-70%, you learn what matters, and you build from data instead of imagination.

The Cost of Waiting: Two Teams Side By Side

The "Let's Do This Right" Team

Their product team? Stuck. Can't ship improvements without evals. Operating completely blind on how good their model actually is.

The "Ship Now, Iterate Later" Team

One month to evals versus twelve months. By the time the "right" team launches, this team has already shipped twice.

The Dollar Cost: Why Perfection Destroys Companies

Let's just do the math here.

Perfect evals: 12 months to get 85% quality Fast evals: 4 weeks to get 60% quality, then iterate Assumptions: - Each eval delay = each month of delayed model improvements - Delayed iteration = $50K+ lost revenue/month - 2% model improvement = $200K new revenue Perfect approach cost: - 2 engineers × 12 months: $400K in salary - Lost revenue from 12 months of no improvements: $600K - Total sunk: $1M+ for a system that isn't even live yet Fast approach cost: - 2 engineers × 4 weeks: ~$30K - Evals live week 4, improvements ship week 8 - By month 12, you've shipped 6 iterations - Total: $30K to be months ahead of the competition

The perfect team spent 30x more money to deliver nothing. The fast team spent $30K and shipped a working product months earlier. And by month 12, they're so far ahead that the perfect team's even playing catch-up.

Good Enough Isn't Garbage. It's Just Honest.

We're not saying build garbage. We're saying don't optimize for precision when you should optimize for speed. "Good enough" means:

A 70%-rigorous eval that runs every day beats a 95%-rigorous eval that ships next year. By a lot.

The Trap Disguises Itself as Responsibility

Here's why this trap is so sneaky. Your ML engineer says "we should test fairness across demographics." That's right. Your PM says "we need to handle all our use cases." Also right. Your CEO asks "what about edge cases?" Legitimate. Each individual decision makes sense. But together they add six months to your timeline.

At some point you've got to ship. An 80% eval that's live is worth infinitely more than a 95% eval that doesn't exist.

What actually works: ship the minimum viable thing. Run it in production. See what breaks. Build iteration 2 based on reality instead of imagination. Your real failure modes are never what you predicted. So let production data guide you instead of theorizing.

The One Question That Saves Months

When your team pitches a new eval metric, ask: "What decision breaks if we don't have this?" If the answer is "honestly, not sure" or "we could infer it from other metrics," don't build it. Only build evals where not having it means you ship something broken.

That question is the only discipline you need. And it saves months. Ship evals faster than the company debating whether they need fairness metrics. Iterate faster. Win.

Ready to Ship Evals Instead of Building Them?

We help companies define their Minimum Viable Eval and deploy it in weeks, not months. Skip the perfectionism trap and start iterating on what matters.

Let's Talk Pragmatism