Metrics Approach

Defining "Good" AI in Terms of Business Cost

By Vikesh

Published February 18, 2026

Read Time 12 min

Your data scientist walks into the room and says "94% F1 score." You nod like you understand. But deep down, you're thinking: does that actually make us money?

Here's the uncomfortable truth. F1 scores don't mean anything to your CFO. Neither do precision, recall, AUC. These are academic theater. They're fine if you're in a PhD program comparing algorithms. But in a business? They're noise. A model that hits 87% F1 while costing $2M a year is objectively worse than one that scores 79% F1 but makes you $8M. Yet most teams can't tell you which one they'd pick.

The real issue isn't the metrics themselves. It's that we've never built a bridge between how data scientists think and how business people think. Your DS team is speaking in accuracy and curves. Your CFO is speaking in dollars per transaction. These languages don't translate. At all.

And that's where the whole thing falls apart. Not because models are bad. They're actually pretty good. But because nobody can actually answer the question: is this AI system making us money? Everyone's measuring something different, so nobody knows.

From F1 Scores to Cost-Per-Error Metrics

So we need to fix this. The solution is actually dead simple: stop asking "how accurate is it?" and start asking "what does it cost when it's wrong?"

I know, I know. It sounds obvious. But most teams don't do this. Let me walk through the approach we use:

Identify all the ways it can fail: False positive? False negative? Some errors matter more than others.
Put a number on each error: How much does it actually cost when the model screws up?
Calculate total cost per batch of predictions: Run the numbers on 1,000 predictions. What's the real damage?
Compare to doing nothing: Would you be better off just hiring humans? Or using last year's method?
Calculate actual ROI: Does this model actually save us money?

Let me show you what this looks like in real life. Say you're in customer support and you're trying to automate ticket routing. Here are your error types:

Step 1: Define Error Types

True Positive: Ticket routed correctly → goes straight to the right team, customer gets helped fast
False Positive: Ticket goes to the wrong team → they have to re-route it, customer waits longer, everyone's annoyed
True Negative: Model correctly says "I don't know, send to human" → ticket goes to a real person
False Negative: Model says "I don't know" when it actually could have handled it → you're paying humans to do work the AI could've done

Step 2: Calculate Cost-Per-Error

Now comes the harder part. Sit down with your finance and ops teams and ask: how much does each one of these actually cost us?

Cost Matrix: Ticket Categorization

True Positive (TP): +$0.50/ticket
(We save $1.50 on manual handling minus the $1 we pay for the system)

False Positive (FP): -$8.50/ticket
(Re-routing costs us $8, and the customer's frustrated: worth $0.50 in goodwill)

True Negative (TN): -$1.00/ticket
(We're paying $1 for human review, which is the safe choice)

False Negative (FN): -$1.00/ticket
(We could've automated it, but we paid a human instead: that's lost efficiency)

Step 3: Measure Under Cost

Pull your model's actual performance stats. Let's say it does something like this:

85% getting routed correctly (true positives)
3% getting routed to the wrong team (false positives)
8% flagged correctly for manual review (true negatives)
4% unnecessarily flagged for manual review (false negatives)

Now run the numbers on 1,000 actual tickets:

850 TPs × $0.50 = $425
30 FPs × -$8.50 = -$255
80 TNs × -$1.00 = -$80
40 FNs × -$1.00 = -$40

You're making $50 per 1,000 tickets. Compare that to manually routing everything (which makes $0: it's just a cost center).

And suddenly you can tell your CFO something real: "This model makes us $50 per thousand tickets. We process 2M a year, so that's $100K in annual value." That's a number they understand. That's something you can track month to month.

The Business Cost Function Model

Let me formalize this a bit. What we're doing is building what I call the Business Cost Function. Basically a system for translating model performance into actual dollars:

Building an Executive-Friendly Metrics Dashboard

Once you've got your cost matrix nailed down, build a dashboard. Not for data scientists. For executives. Something that updates daily and actually tells the story of whether your AI is working:

AI System: Ticket Categorization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📊 DAILY PERFORMANCE
  Cost-per-ticket:        $0.42 (vs $1.50 manual)
  Daily value:            $8,400 (10K tickets × $0.42 savings)
  Weekly trend:           ↑ +3.2% (improving)
  Monthly cumulative:     $252,000

⚠️  ERROR ANALYSIS
  False positive rate:    2.8% (target: <3%)
  Cost of FPs:           $2,800 (per day)
  FP trend:              ↑ Degrading (was 2.2% last week)

✅ BUSINESS IMPACT
  Annual projection:      $3.07M (if current performance holds)
  vs. Manual routing:     Saves $4.5M/year
  Payback period:         4.2 months from deployment

⚠️  ALERTS
  • False positive rate trending up 27% vs last month
  • Two specific ticket categories showing drift
  • Recommend manual review of FP patterns
            

See the difference? A confusion matrix tells you "82.3% precision." That dashboard tells you something your CFO actually cares about: you're making $3M a year, it's ticking up slightly, but those false positives are getting worse and you should probably look at that.

Calculating Cost-of-Wrong for Each Prediction Type

Here's where it gets tricky. You need to sit down and actually talk to your business people about what these errors cost. Not guessing. Not "it probably costs something." Actual numbers.

For Each Error Type, Ask:

What does it cost right now? The immediate, direct cost?
What happens downstream? Does it create other problems?
What opportunity are we losing? What money don't we make?
What's the risk exposure? Regulatory hit? Customer churn? Lawsuits?

Example: False Positive in Fraud Detection

Direct: Support team spends 30 minutes investigating = $40
Indirect: Customer has to jump through hoops to prove they're not fraudulent = $20 in friction
Opportunity: Sometimes the customer just walks away from the transaction = $60 in lost revenue
Risk: They post on social media about how your system flagged them unfairly, some customers churn = $40 expected value
Total cost per bad flag: $160

Yeah, it's tedious. And yes, you're going to argue about some of these numbers. But once you've done it, model evaluation gets boring: Does this model cost us less than the alternative? If yes, we deploy it. If no, we don't.

Real-World Metrics That Work

Here's what this looks like in different industries. Notice the pattern: we're not measuring accuracy. We're measuring money:

Domain-Specific Business Metrics

Lending: Net cost per approved application (you save on manual review, lose on bad defaults)
E-commerce: Revenue from recommendations minus the cost when recommendations are trash
Healthcare: Cost per diagnosis (what you spend treating patients correctly, plus what you lose when you're wrong)
Operations: Cost per decision (what it costs to act, plus what it costs when you act wrong)
Support: Cost per customer interaction (automation savings minus how much bad routing damages relationships)
HR: Cost per hire (good decisions save time, bad decisions cost a ton in turnover and training)

See what's happening here? None of these are about accuracy. They're all about money. Whether the AI is "accurate" is almost irrelevant. The only thing that matters is whether it costs less than the alternative.

Implementation: Getting Started

So how do you actually do this? Here's the path I'd recommend:

Week 1: Get Everyone in a Room and Agree on Costs

Bring your data scientists, your finance people, and your ops people together. You're going to argue. That's fine (that's the point). By the end of the week, you've got a cost matrix everyone has agreed to. Write it down. Get sign-off. This is your truth.

Week 2-3: Throw Out Your Old Metrics

Stop measuring F1 score. Stop measuring precision. Start measuring cost-per-error. When you evaluate models, you're asking one question: does Model A or Model B cost less?

Week 4: Build a Real Dashboard

Something your CFO would want to look at. Daily updates. Is the model making money today? Is it trending up or down? Are we hitting our targets?

Ongoing: Revisit Your Cost Matrix Every Quarter

Business changes. Markets shift. Your cost matrix should shift with it. If fraud's gotten more expensive, update it. If your error tolerance's changed, update it. Your models aren't static. Why should your cost function be?

The Core Insight

F1 scores are useless. They're not even measuring what you care about. Define what errors actually cost. Then optimize for that. Your CFO will understand in thirty seconds. Your board will ask fewer questions. Your team will stop arguing about metrics nobody cares about.

Why This Matters

Organizations that do this well (the ones that actually connect model performance to money) are the ones whose AI works long-term. They're not deploying models and crossing their fingers. They know exactly what they're making every single day. They catch problems before they become disasters. And when they talk to their CFO, they don't sound like they're speaking a foreign language.

This isn't a technical thing. It's organizational. Your business people and your AI people need to speak the same language. That language is money. It takes discipline to define your cost matrix properly. It takes honesty to admit when your model's losing money. But once you do it, everything else gets simple.

Model selection? Done. Pick whichever one costs less. Explaining AI to the CFO? Done. It makes us $10M a year. Knowing when to retrain? Done. If costs rise, we act. If costs stay stable, we leave it alone.

So here's what I'd do: stop reading this and go define your cost matrix. Actually sit down with your team. Work through each error type. Put numbers on it. That one conversation will change your entire relationship with AI evaluation.

Is Your Team Speaking Different Languages?

We help align data science, engineering, and business teams on unified cost-aware metrics that predict actual ROI.

Schedule a Metrics Workshop