System-Level Evals and Revenue Protection

Here's the thing most teams miss: they're testing their models in isolation. Does this LLM write good summaries? Does this classifier categorize stuff correctly? Does this recommendation engine work? All good questions. But they're the wrong questions because they're looking at pieces, not the whole thing.

The real disasters happen when those pieces talk to each other. This is where evals fail silently, and where revenue quietly disappears. Two components that both pass tests can absolutely tank the system when they're connected.

Components pass. System fails. And that's the gap where all the money drains out.

The Gap: What Component Tests Miss

Take a recommendation system. Let's say you have an e-commerce site and you're using AI to suggest products:

Component 1: Embeddings (LLM-based)
- Test: How good are the embeddings?
- Result: ✓ PASS (They're accurate)

Component 2: Preference Classifier (tuned model)
- Test: Does it classify correctly?
- Result: ✓ PASS (93% accuracy)

Component 3: Ranker (rule-based)
- Test: Does it improve click rates?
- Result: ✓ PASS (5% CTR boost in testing)

Component 4: Business Logic (rules)
- Test: Do the constraints work?
- Result: ✓ PASS (All rules enforced)

End-to-End System Test: None
        

All of it passes. Then you ship it. But something's off. Here's what's actually happening:

Embeddings to Classifier: The embeddings capture meaning, but the classifier learned from totally different data. They don't map to actual preference the way you'd hope.
Classifier to Ranker: The classifier spits out a confidence score. The ranker trusts it completely and doesn't account for uncertainty. Wrong predictions with high confidence get bumped to the top.
Ranker to Business Rules: The ranker optimizes for what the classifier says. Business rules are just constraints. Together they recommend things that are valid but terrible (like suggesting a complementary item to someone who already bought it).
System to Reality: Everything optimizes for what you measured (CTR, margin). Nobody measured what actually matters: are customers happy? Do they come back? Do they stay?

So you launch. Revenue per user doesn't budge. Churn goes up. On the dashboard it all looks fine. The system's technically working. But it's broken.

Why Component Tests Are Incomplete

Component tests optimize locally. System tests optimize globally. They're measuring different things:

Dimension	Component Tests	System Tests
What's Being Tested?	Does this piece work?	Does the whole thing create value?
What Gets Measured?	Technical stuff (accuracy, precision, recall)	Business stuff (revenue, churn, satisfaction)
What Data?	Clean test data (handpicked)	Real data (messy, real users)
What Breaks?	Individual piece bugs	How pieces crash when they talk
Cost	$20K per piece	$80K-150K for everything

The trap: teams build component tests and call it done. They assume that if all pieces pass, the whole thing works. That's rarely true when you've got multiple ML layers talking to each other.

Building System-Level Tests That Protect Revenue

Here's the right architecture:

Look at that diagram. Component tests measure the wrong thing. They can all pass while the system tanks. System-level tests measure what counts: real business outcomes on real data.

Three Ways System Tests Catch What Component Tests Miss

1. Pieces Don't Play Together Well

Two things that both work individually can completely break when they're connected. Picture: a classifier that's accurate but way too confident, plus a ranker that trusts classifier scores completely. Result: your top recommendations are consistently wrong for specific groups of customers.

Component tests never catch this. The tests are separate. You only see it when everything's wired together.

2. Real Data Isn't Like Test Data

A system might work great on average but fail when it hits real life. Example: recommendations work fine for popular products but choke on obscure ones. Or they crush it in summer but fail in winter.

Component tests run on clean, balanced data. Reality is skewed, seasonal, messy. System tests measure performance on actual user behavior.

3. You're Optimizing for the Wrong Thing

The system hits all your metrics. CTR is up. But users aren't staying. The ranker optimizes for clicks, but people who click don't buy, so they leave. You've optimized the system perfectly at the component level. But you're optimizing the wrong goal.

This is the sneaky one. Everything's working as designed. It's just designed badly.

How Much Does This Cost? The Real Numbers

Want to know the actual impact? Track it after you deploy system tests:

Starting Point (no system tests): $50 revenue per user/month

Build System Tests → Find 3 Problems:
- Recommending out of stock stuff
- Missing seasonal patterns
- No context awareness

Fix Based on Test Results:
- Add inventory to ranker
- Seasonal tweaks to embeddings
- User context awareness

Real impact:
- Revenue per user: $51.50 (+3%)
- Churn: Down 1.2%
- Average order value: +2.8%

For 1M users:
- Revenue lift: $1.5M per year
- System test cost: $80K per year
- ROI: 1,875%
        

These aren't made-up numbers. We've seen these patterns consistently. Here's why: almost every AI system optimizes for the wrong thing at the component level. System tests expose that misalignment.

When Should You Actually Build System Tests?

System tests cost more than component tests. When are they worth it?

Multiple AI stages: If you've got 2+ ML pieces in a row, you need system tests. Error compounds through the layers.
Revenue matters: If you care about revenue, churn, satisfaction (not just technical metrics), system tests are essential.
It talks to other stuff: If your AI connects to business rules, inventory, user data, or other systems, failures come from those connections. System tests catch it.
When it breaks, it's bad: If failure in this system could damage customers or tank revenue, system tests are insurance you can't skip.

Most teams answer yes to 2-3 of these. If that's you, system tests belong on the roadmap.

How to Build System Tests (Start With Business, Work Backward)

When you're setting up system tests, don't start with technical metrics. Start with business:

Step 1: Define What Actually Matters
├─ Main: Revenue per user, churn, customer happiness
├─ Secondary: Engagement, clicks, feature usage
└─ Technical: Speed, errors (lowest priority)

Step 2: Instrument to Measure It
├─ Track how decisions flow through
├─ Measure what happens with each decision
└─ Connect back to business metrics

Step 3: Build Test Journeys
├─ 100 realistic scenarios
├─ Run the system against each
└─ Calculate what revenue impact would be

Step 4: Full System Test
├─ Real data, real conditions
├─ Everything measured
└─ Fail if business metrics drop

This way you're testing what's actually important.
        

Why System Failures Cost Way More

Here's the thing: component failures are bad. System failures are worse. A broken component usually fixes or rolls back quickly. A broken system? Investigation, debugging, extended downtime.

Component failure: $10-50K. System failure: $100K-1M. For any serious system, system tests cost less than one failure.

The Bottom Line: You Need Both Layers

Component tests? Necessary. But not sufficient. They test the parts, not the whole. The whole breaks when pieces talk weirdly, when real data doesn't match assumptions, when business goals and optimization don't align.

System tests measure what matters: does this system actually create value? That's where revenue lives. That's where you find problems before customers do. And that's what most teams skip.

If you're testing components but not the system, you're blind on the one metric that actually matters: whether your AI makes money.