Forecast vs Actual: The $20M Question AI Gets Wrong in CIMs

Here is a scenario that is completely ordinary in a deal process, and quietly fatal for generic AI tooling.

A CIM written in 2023 projects FY2025 revenue at $52 million. It is now 2026. The deal is back in market, and the data room now also contains the company's 2025 financial statements: $32 million, reported. Both numbers describe "2025 revenue." Both are true statements within their own frame. One is what management hoped in 2023; the other is what happened.

Now ask an AI assistant sitting on that data room: "What was 2025 revenue?"

With generic retrieval, the honest answer to "what will the AI say" is: whichever chunk ranks first. Sometimes $52M, sometimes $32M, sometimes a fluent blend of both. That is a $20 million spread on the single most load-bearing number in a screen, delivered with a confident tone and a citation either way.

Why Deal Documents Do This

This is not sloppy documentation. It is how deal documents work.

A CIM has a vintage. Projections are not an appendix to a CIM; they are much of its point. A banker writing in 2023 is selling a trajectory, so the document is dense with forward years presented in the same tabular format as historicals, sometimes distinguished by nothing more than a "P" or an "E" in a column header, sometimes by a footnote, sometimes by convention alone.

Then the data room accumulates. The original CIM stays. Audited financials arrive. A quality-of-earnings report restates EBITDA with its own adjustments. A reforecast lands mid-process. A lender model gets updated twice. By the time a team is deep in diligence, a single fiscal year can exist in the corpus as a management projection, a budget, a reforecast, a preliminary actual, and an audited actual, each in a different document with a different date.

Human analysts handle this instinctively. They check the document's date before trusting its numbers. They read column headers. They know a figure from a 2023 CIM cannot be an actual for 2025. The skill is so automatic that nobody thinks of it as a skill, until they watch software that lacks it.

Why Generic Retrieval Gets It Wrong

The dominant pattern for AI document tooling is retrieval-augmented generation: embed the documents, embed the question, fetch the most similar passages, and let a frontier AI model compose an answer from them.

The failure is in the word similar. "FY2025 revenue" in a projection table and "FY2025 revenue" in an audited income statement are nearly identical strings. The embedding of a chunk captures what the text is about, not its epistemic status. Nothing in a similarity score encodes "this figure is a forecast made three years before the fact" versus "this figure is a reported result." The ranker is answering the question which passage resembles the query, when the question that matters is which figure is the current truth about this period.

So the projection and the actual arrive at the model as peers, and which one leads can flip with index state, chunking boundaries, or phrasing. The model, given both, may pick either or average the framing. And because the output is fluent and cited, it reads as verified. A citation to a real page of a real document is not the same thing as the right page of the right document.

What This Does to Screening Math

Screening runs on a handful of numbers, and this failure lands on exactly those numbers.

Take the EBITDA version of the same scenario. The 2023 CIM projects FY2025 EBITDA of $10 million. The actual comes in at $6.5 million. The deal is being discussed around a $65 million enterprise value.

On projected EBITDA, that is 6.5x, which screens as sane for the sector. On actual EBITDA, it is 10x, which is a different conversation entirely, and possibly no conversation. Same deal, same data room, same question. The only variable is which basis the number came from.

The second-order loss is worse. The gap between plan and actual is one of the most information-dense signals in a data room: a company that missed its 2023 plan by 38% has told you something important about management forecasting, demand, or both. A system that cannot tell forecast from actual does not just risk quoting the wrong number. It erases the variance signal entirely, because you cannot compute plan-versus-actual if your tooling thinks they are the same number.

The Fix: Four Disciplines

None of this is solved by a better model or a bigger context window. A model reading an unlabeled figure has no more information than the retrieval gave it. The fix is structural, and in our experience it takes four disciplines working together.

1. Basis-first labeling. Every extracted figure carries its basis, actual, forecast, budget, pro forma, LTM, from extraction all the way into the answer a user reads. An answer of "FY2025 revenue: $32.0M (actual, per FY2025 financial statements)" can be verified in seconds. An answer of "$52M" cannot even be evaluated, because it does not say what claim it is making. Internally, we treat a figure without a basis label as unfinished extraction, not as an answer.

2. Actual-first series selection. When a period exists as both forecast and actual, the reported actual wins by default for any "what was" question. The projection is not discarded; it is demoted and relabeled: "the 2023 CIM projected $52M for FY2025." Crucially, this rule is enforced when the financial series is assembled, not delegated to the language model's judgment one answer at a time. A rule applied at composition is a property of the system. A rule suggested in a prompt is a tendency.

3. As-of-date discipline. Every figure carries two dates: the period it describes and the date it was asserted. A 2023 projection of 2025 is a statement made in 2023, and keeping both dates is what lets the system answer time-based questions correctly. "What did management expect FY2025 to look like at the time of the CIM?" is a legitimate question, a different question, and often a very good one. Without as-of dates, it is unanswerable. With them, it is a filter.

4. Canonical financial caches, refreshed on document change. Re-deriving core financials from raw retrieval on every question re-rolls the ranking dice every time. Instead, maintain one canonical financial series per deal, built with the three disciplines above, with basis and as-of on every cell, and answer figure questions from that. Then treat freshness as a hard requirement: a new document, a superseded version, or a deleted file triggers a recompute. Skip that last part and you have traded "wrong chunk wins" for "stale cache wins," which is the same failure with better manners.

What Stays Human

Everything ReturnCatalyst produces is decision-support for professional review, not investment advice, and this is a case where that framing has teeth. The goal is not an AI that decides whether 6.5x or 10x is the real multiple. The goal is that the analyst receives figures that arrive labeled, dated, and sourced to a page, so their review time goes into judgment about the deal rather than forensic accounting on their own tooling.

We also hold these paths to the same standard we described in how we eval AI accuracy on real deal documents: strict, figure-level golden answers where the basis label is part of the pass criteria, gating what ships.

If your team screens deals from CIMs, the test is simple. Take a data room where you know a period exists as both plan and actual, and ask the question. See what comes back, and whether it tells you which one it gave you. That is the whole evaluation.

To see how basis-aware extraction works in practice, start with CIM analysis or see how labeled figures flow into AI financial modeling.