Product Strategy
Vikesh February 18, 2025 13 min read

System-Level Evals and Revenue Protection

Here's the thing most teams miss: they're testing their models in isolation. Does this LLM write good summaries? Does this classifier categorize stuff correctly? Does this recommendation engine work? All good questions. But they're the wrong questions because they're looking at pieces, not the whole thing.

The real disasters happen when those pieces talk to each other. This is where evals fail silently, and where revenue quietly disappears. Two components that both pass tests can absolutely tank the system when they're connected.

Components pass. System fails. And that's the gap where all the money drains out.

The Gap: What Component Tests Miss

Take a recommendation system. Let's say you have an e-commerce site and you're using AI to suggest products:

Component 1: Embeddings (LLM-based) - Test: How good are the embeddings? - Result: ✓ PASS (They're accurate) Component 2: Preference Classifier (tuned model) - Test: Does it classify correctly? - Result: ✓ PASS (93% accuracy) Component 3: Ranker (rule-based) - Test: Does it improve click rates? - Result: ✓ PASS (5% CTR boost in testing) Component 4: Business Logic (rules) - Test: Do the constraints work? - Result: ✓ PASS (All rules enforced) End-to-End System Test: None

All of it passes. Then you ship it. But something's off. Here's what's actually happening:

So you launch. Revenue per user doesn't budge. Churn goes up. On the dashboard it all looks fine. The system's technically working. But it's broken.

Why Component Tests Are Incomplete

Component tests optimize locally. System tests optimize globally. They're measuring different things:

Dimension Component Tests System Tests
What's Being Tested? Does this piece work? Does the whole thing create value?
What Gets Measured? Technical stuff (accuracy, precision, recall) Business stuff (revenue, churn, satisfaction)
What Data? Clean test data (handpicked) Real data (messy, real users)
What Breaks? Individual piece bugs How pieces crash when they talk
Cost $20K per piece $80K-150K for everything

The trap: teams build component tests and call it done. They assume that if all pieces pass, the whole thing works. That's rarely true when you've got multiple ML layers talking to each other.

Building System-Level Tests That Protect Revenue

Here's the right architecture:

End-to-End Eval Architecture COMPONENT LAYER Embedding Model Semantic quality eval Classifier Accuracy eval (93%) Ranker CTR improvement eval INTEGRATION GAPS (Not Measured) Embedding → Classifier Mismatch: 5% drift Classifier → Ranker Rank inflation: top 10% SYSTEM LAYER End-to-End System Eval Measured on real data: User satisfaction, CTR, conversion, revenue per user Test: 10K real user sessions, measure business outcomes FAILURES CAUGHT AT SYSTEM LEVEL ONLY Failure Mode A High-ranking recommendations are technically good but contextually bad (already owned by user, out of stock) Business impact: -3% AOV Failure Mode B Recommendations don't account for seasonal preferences. Winter recommendations bad in summer Business impact: -12% retention REVENUE IMPACT WITHOUT System Evals Failures discovered by customers Time to fix: 6 weeks Revenue loss: $300K+ (3% AOV × 12 weeks) WITH System Evals Failures caught before shipping Time to fix: 1-2 weeks Revenue saved: $250K (failures prevented)

Look at that diagram. Component tests measure the wrong thing. They can all pass while the system tanks. System-level tests measure what counts: real business outcomes on real data.

Three Ways System Tests Catch What Component Tests Miss

1. Pieces Don't Play Together Well

Two things that both work individually can completely break when they're connected. Picture: a classifier that's accurate but way too confident, plus a ranker that trusts classifier scores completely. Result: your top recommendations are consistently wrong for specific groups of customers.

Component tests never catch this. The tests are separate. You only see it when everything's wired together.

2. Real Data Isn't Like Test Data

A system might work great on average but fail when it hits real life. Example: recommendations work fine for popular products but choke on obscure ones. Or they crush it in summer but fail in winter.

Component tests run on clean, balanced data. Reality is skewed, seasonal, messy. System tests measure performance on actual user behavior.

3. You're Optimizing for the Wrong Thing

The system hits all your metrics. CTR is up. But users aren't staying. The ranker optimizes for clicks, but people who click don't buy, so they leave. You've optimized the system perfectly at the component level. But you're optimizing the wrong goal.

This is the sneaky one. Everything's working as designed. It's just designed badly.

How Much Does This Cost? The Real Numbers

Want to know the actual impact? Track it after you deploy system tests:

Starting Point (no system tests): $50 revenue per user/month Build System Tests → Find 3 Problems: - Recommending out of stock stuff - Missing seasonal patterns - No context awareness Fix Based on Test Results: - Add inventory to ranker - Seasonal tweaks to embeddings - User context awareness Real impact: - Revenue per user: $51.50 (+3%) - Churn: Down 1.2% - Average order value: +2.8% For 1M users: - Revenue lift: $1.5M per year - System test cost: $80K per year - ROI: 1,875%

These aren't made-up numbers. We've seen these patterns consistently. Here's why: almost every AI system optimizes for the wrong thing at the component level. System tests expose that misalignment.

When Should You Actually Build System Tests?

System tests cost more than component tests. When are they worth it?

Most teams answer yes to 2-3 of these. If that's you, system tests belong on the roadmap.

How to Build System Tests (Start With Business, Work Backward)

When you're setting up system tests, don't start with technical metrics. Start with business:

Step 1: Define What Actually Matters ├─ Main: Revenue per user, churn, customer happiness ├─ Secondary: Engagement, clicks, feature usage └─ Technical: Speed, errors (lowest priority) Step 2: Instrument to Measure It ├─ Track how decisions flow through ├─ Measure what happens with each decision └─ Connect back to business metrics Step 3: Build Test Journeys ├─ 100 realistic scenarios ├─ Run the system against each └─ Calculate what revenue impact would be Step 4: Full System Test ├─ Real data, real conditions ├─ Everything measured └─ Fail if business metrics drop This way you're testing what's actually important.

Why System Failures Cost Way More

Here's the thing: component failures are bad. System failures are worse. A broken component usually fixes or rolls back quickly. A broken system? Investigation, debugging, extended downtime.

Component failure: $10-50K. System failure: $100K-1M. For any serious system, system tests cost less than one failure.

The Bottom Line: You Need Both Layers

Component tests? Necessary. But not sufficient. They test the parts, not the whole. The whole breaks when pieces talk weirdly, when real data doesn't match assumptions, when business goals and optimization don't align.

System tests measure what matters: does this system actually create value? That's where revenue lives. That's where you find problems before customers do. And that's what most teams skip.

If you're testing components but not the system, you're blind on the one metric that actually matters: whether your AI makes money.

Is Your System Missing Critical Evals?

We can audit your AI system architecture and identify gaps between component-level quality and system-level performance. Most teams are surprised by what they find.

Schedule a System Audit