How We Eval AI Accuracy on Real Deal Documents

Every AI vendor selling into private equity has a demo where the answers look right. Upload a CIM, ask about revenue, get a confident answer with a citation. The room nods.

I want to explain why that demo tells you almost nothing, and what we do at ReturnCatalyst instead. Not because our approach is secret, but because the questions in this post are the ones every deal team should be asking any vendor whose numbers end up in a screening model or an IC memo. Including us.

"The Demo Looked Right" Is Not Evidence

A demo shows that a system can be right. It says nothing about how often it is right, on which kinds of documents, or what its failure modes look like on the four-hundredth document instead of the fourth.

Demos are also curated, sometimes without anyone intending to curate them. The document is one the vendor has processed before. The question is one the system handles well. The person driving knows which phrasings work. None of that is dishonest; it is just how demos naturally evolve. But it means a good demo is compatible with a system that is wrong 15% of the time in ways nobody has measured.

Deal teams do not work on curated inputs. They work on 300-page CIMs assembled by different banks with different conventions, Excel files with merged cells and footnoted units, and data rooms where the "final" model has three successors. If accuracy is not measured on that kind of material, it is not measured.

The only way we know to make an honest claim about accuracy is boring: verify answers against known-correct ground truth, on real documents, continuously, with the results wired into what is allowed to ship.

Golden-Answer Sets on Real Deals

The foundation is the golden set: a bank of questions, each paired with an answer a human has verified against the source pages. Ours are built on real deal documents processed in permissioned environments, not on synthetic benchmarks.

That distinction matters more than it sounds. Synthetic test documents are clean. Real CIMs are not. Real documents have tables that span page breaks, units declared in a footnote three pages from the table, the same metric defined two different ways in two sections, and projections sitting next to reported results with nothing but a column header to distinguish them. The messiness is the test. A system that scores perfectly on synthetic documents has passed a spelling test before a debate.

Building goldens is expensive per question, because a person has to trace each answer to its source and confirm it. That expense is the point. A test that was cheap to build measures cheap things.

Strict Goldens: Exact Figures With Basis Labels

We run two tiers of goldens. The regular tier tolerates paraphrase; if the substance is right, wording differences pass. The strict tier does not negotiate.

A strict golden specifies the exact figure, the correct units, the correct period, and the correct basis label. If the verified answer is "FY2025 revenue: $31.6M (actual)," then "revenue was about $31 million" fails. So does "$31.6M" with no basis label, because in deal documents the same fiscal year routinely exists as both a projection and a reported actual, and a number that does not declare which one it is cannot be checked. We wrote up that specific failure mode separately, because it deserves its own post: Forecast vs Actual: The $20M Question AI Gets Wrong in CIMs.

Why be this rigid? Because the downstream use is rigid. Screening multiples, growth rates, and covenant math are computed from these figures. An answer that is "approximately right" produces a multiple that is specifically wrong.

The Ship Gate: Three Consecutive Perfect Runs

Retrieval systems are not deterministic in practice. Index states change as documents are processed. Ranking can shift between runs. Upstream services have transient behavior. A single green eval run can be luck.

So our rule for changes that touch the answer pipeline is a streak, not a pass: the full strict suite must come back perfect three consecutive times against production-like state before the change ships. One perfect run proves possibility. Three in a row begin to prove reliability.

The streak requirement has a useful side effect: it forces us to treat intermittent failures as bugs rather than noise. When a suite passes twice and fails once, the tempting move is to re-run it and move on. But an eval that flakes is almost always exposing real nondeterminism, and real nondeterminism means some user, eventually, gets the bad run. Chasing those flakes down has surfaced more genuine defects than almost anything else we do.

Eval-Gated CI: Regressions Cannot Deploy

The strict suite is wired into our deployment pipeline. If it is red, the change does not deploy. There is no routine human override.

The reason to automate this instead of relying on judgment is that the most dangerous accuracy regressions come from changes that look unrelated. A chunking adjustment made for speed. A reindexing job tweak. A prompt edit for a different feature that shares a template. Nobody reviewing those changes would think to re-run the full accuracy suite by hand, and honestly, neither would we, every time. The pipeline does not need to think of it. It runs regardless.

This inverts the default failure mode of AI products. Without an eval gate, accuracy degrades silently and you find out from a user, or worse, you never find out. With one, degradation announces itself as a blocked deploy, before anyone outside the building could be affected.

Live-Chat Eval Bots in Production

Everything above still shares a weakness: it tests the system before and during deploy. Production is its own environment, with its own index state, its own document corpus, its own permission boundaries and cache lifetimes, and it can drift after a perfectly green release.

So we also run eval bots against production itself. These are dedicated accounts with least-privilege access: they can see exactly one evaluation deal and nothing else. On a schedule, they ask the strict-golden questions through the same chat interface a real user hits, and their answers are checked against the goldens.

This catches the class of failures that only exist in production. A document that reports as processed but silently fell out of the search index. A cache still serving figures from a superseded document version. A fallback path quietly activating after an upstream hiccup and answering from weaker grounding. None of those are visible in CI, because CI does not live in production. The bots do.

The least-privilege part is not optional hygiene. An eval bot is an automated account probing your system on a timer. It should be scoped so that a bug in the bot, or a compromise of it, cannot touch a single byte of client data.

What This Catches That Review Cannot

Across all of this machinery, the recurring theme is that the failures worth catching are plausible. A silently unindexed document produces an answer from the remaining corpus that reads fine. A precedence bug that answers from an outdated model version produces real numbers, just the wrong ones. A right figure with the wrong basis label parses as perfectly reasonable English.

A human reviewer, however senior, cannot spot plausible-but-wrong by inspection. Only comparison against a verified answer can. That is the whole argument for evals in one sentence.

Questions Worth Asking Any Vendor

If you are evaluating AI tooling for deal work, ours or anyone's, the useful questions are mechanical:

None of this is proprietary insight. It is discipline, applied to a domain where the cost of a wrong number is measured in basis points and reputations. Our outputs are decision-support for professional review, not investment advice, and the evals are what make that review tractable: page-level citations tell the analyst where to look, and measured accuracy tells the team how much looking is warranted.

If you want to see how this discipline shows up in the product, start with our AI platform for private equity or go straight to CIM analysis and run a document you have already modeled by hand. Comparing against your own numbers is, after all, exactly the methodology we are advocating.