Doubling a Coin Flip: How Health Systems Keep Misreading the AI Accuracy Claim

A major academic health system announced this month that its new AI tool for clot-risk screening had doubled the accuracy of the manual questionnaire it replaced. The story traveled the way these stories always travel: fast, positive, and stripped of the one number that matters.

The manual tool, by the system's own description, was right "fifty-ish percent of the time."

Sit with that. The baseline the new model doubled was, functionally, a coin flip. Doubling a coin flip is a genuine improvement. It is also a sentence that should generate questions, not applause.

The accuracy frame is a marketing instrument

I want to be careful here, because this is not a story about a bad tool. A model that reads the chart for deep vein thrombosis risk and drug interactions faster than a questionnaire clinicians rarely complete in full is a real clinical gain. The problem is not the technology. The problem is the language we use to evaluate it.

"Doubled accuracy" is a relative claim with no denominator. Doubled from what to what? Fifty percent to one hundred would be remarkable and almost certainly overstated. Fifty to seventy is meaningful and plausible. Fifty to sixty is real and still leaves a tool wrong four times in ten. The press release does not say, and the absence is the tell.

This is the pattern I have watched repeat with every wave of clinical technology for two decades. The capability gets announced in the vocabulary of marketing. The evidence gets evaluated, if it gets evaluated, in the much slower vocabulary of clinical validation. The gap between those two timelines is where patients get hurt and where health systems take on risk they never priced.

"The capability gets announced in the vocabulary of marketing. The evidence gets evaluated in the much slower vocabulary of clinical validation. The gap between those two timelines is where patients get hurt."

The questions that never make the announcement

When a clinical leader is handed an accuracy claim, there are four questions that separate a governance process from a purchasing impulse.

Four questions every accuracy claim should answer

Doubled against what ground truth? An accuracy number means nothing without the reference standard it was measured against. Confirmed imaging results are a different bar than a retrospective chart review, which is a different bar than agreement with the prior tool. If the comparator is weak, the improvement is theater.
Validated in which population? A model trained and tested on one health system's patients does not automatically perform the same on yours. Age distribution, comorbidity burden, and documentation habits all move the numbers. Vendor-reported accuracy is the ceiling, not the floor you will see at the bedside.
Who reviews the output when the model is wrong? A tool that is right sixty percent of the time is wrong forty percent of the time. That is not a defect; it is a specification. The governance question is whether a named human reviews the flag, on what cadence, with what authority to override.
What is the escalation path when it misses? Every clinical AI tool will miss. The safe systems are not the ones with the best models. They are the ones that wrote down, in advance, what happens when the model is wrong on a Tuesday afternoon and the person who deployed it is out of the building.

A health system leader at CommonSpirit framed the same principle plainly this month: AI's value is scaling the screening work and cutting administrative load, but you cannot leave things entirely up to the model. The human stays in the loop by design, not by accident. That is the correct read, and it is the opposite of "doubled accuracy, deployed, done."

What governance actually looks like here

The fix is not to slow down adoption. It is to put the accuracy claim through a process before it becomes a purchase order.

Require the absolute numbers, not the relative ones. "Fifty to sixty-eight percent on confirmed imaging in a population like ours" is a sentence a board can govern. "Doubled accuracy" is not.
Require the validation population, and ask whether it resembles yours. If it does not, require a local validation step before full deployment, not after.
Name the human in the loop, the review cadence, and the override authority, and put all three in the governance policy, not the vendor's dashboard.
Write the failure protocol before go-live. If you cannot describe what happens when the tool is wrong, you are not ready to turn it on.

None of this is anti-innovation. It is the architecture that lets innovation survive contact with a real patient population and a real malpractice environment. The systems that build it will deploy AI faster in the long run, because they will not have to stop and rebuild trust after the first miss makes the local news.

The boardroom read

The next time an accuracy claim crosses a leadership table, the most valuable person in the room is the one who asks the unglamorous question: from what, to what, measured how, and who checks it when it is wrong.

That question does not generate a press release. It is the difference between a health system that governs its AI and one that is governed by its vendors' marketing.

Clinical AI Governance Assessment

Surgery-trained, currently practicing internal medicine (charity care). Advisory practice focused on clinical AI governance, vendor evaluation, and implementation strategy for health systems and health-tech companies.

Book a scoping call

Doubling a Coin Flip: How Health Systems Keep Misreading the AI Accuracy Claim

The accuracy frame is a marketing instrument

The questions that never make the announcement

Four questions every accuracy claim should answer

What governance actually looks like here

The boardroom read

Clinical AI Governance Assessment

The Sarah Matt Briefing