A major academic health system announced this month that its new AI tool for clot-risk screening had doubled the accuracy of the manual questionnaire it replaced. The story traveled the way these stories always travel: fast, positive, and stripped of the one number that matters.
The manual tool, by the system's own description, was right "fifty-ish percent of the time."
Sit with that. The baseline the new model doubled was, functionally, a coin flip. Doubling a coin flip is a genuine improvement. It is also a sentence that should generate questions, not applause.
I want to be careful here, because this is not a story about a bad tool. A model that reads the chart for deep vein thrombosis risk and drug interactions faster than a questionnaire clinicians rarely complete in full is a real clinical gain. The problem is not the technology. The problem is the language we use to evaluate it.
"Doubled accuracy" is a relative claim with no denominator. Doubled from what to what? Fifty percent to one hundred would be remarkable and almost certainly overstated. Fifty to seventy is meaningful and plausible. Fifty to sixty is real and still leaves a tool wrong four times in ten. The press release does not say, and the absence is the tell.
This is the pattern I have watched repeat with every wave of clinical technology for two decades. The capability gets announced in the vocabulary of marketing. The evidence gets evaluated, if it gets evaluated, in the much slower vocabulary of clinical validation. The gap between those two timelines is where patients get hurt and where health systems take on risk they never priced.
"The capability gets announced in the vocabulary of marketing. The evidence gets evaluated in the much slower vocabulary of clinical validation. The gap between those two timelines is where patients get hurt."
When a clinical leader is handed an accuracy claim, there are four questions that separate a governance process from a purchasing impulse.
A health system leader at CommonSpirit framed the same principle plainly this month: AI's value is scaling the screening work and cutting administrative load, but you cannot leave things entirely up to the model. The human stays in the loop by design, not by accident. That is the correct read, and it is the opposite of "doubled accuracy, deployed, done."
The fix is not to slow down adoption. It is to put the accuracy claim through a process before it becomes a purchase order.
None of this is anti-innovation. It is the architecture that lets innovation survive contact with a real patient population and a real malpractice environment. The systems that build it will deploy AI faster in the long run, because they will not have to stop and rebuild trust after the first miss makes the local news.
The next time an accuracy claim crosses a leadership table, the most valuable person in the room is the one who asks the unglamorous question: from what, to what, measured how, and who checks it when it is wrong.
That question does not generate a press release. It is the difference between a health system that governs its AI and one that is governed by its vendors' marketing.
Surgery-trained, currently practicing internal medicine (charity care). Advisory practice focused on clinical AI governance, vendor evaluation, and implementation strategy for health systems and health-tech companies.
Weekly clinical reality on AI, governance, and what actually works inside health systems.