The Science of Confabulation Elimination
Toward Hallucination-Free AI-Generated Clinical Notes
Introduction
Ambient AI has rapidly become ubiquitous in healthcare workflows.
The rapid adoption of this technology speaks to the tangible benefits experienced by clinicians.
In this paper, we focus on a topic that is top-of-mind for adoption of any AI-powered product: “hallucinations.”
There is now an opportunity to reduce errors in clinical documentation, even to the point of near-elimination.

With the accelerating adoption of ambient AI, there is now an opportunity to reduce these errors in clinical documentation, even to the point of near-elimination.
Not all unsupported claims (colloquially called “hallucinations") are created equal.
We first discuss how we characterize and categorize the degree to which claims are supported by context, and the severity of unsupported claims. Many claims sit at the boundary of “reasonable inferences,” since clinical conversations can leave room for interpretation.
We then reveal the inner workings of our note-generation pipeline to discuss how we approach the challenge of ensuring that every claim is faithful to clinical context. We focus in particular on the guardrails we have developed, supported by over one thousand hours of human validation, to automatically revise AI-generated “first drafts” in real time, preemptively detecting and correcting unsupported claims before a clinician reviews their draft documentation.
Assessing Factuality of AI-generated Documentation





When considering unsupported claims, it is essential to provide clear definitions to distinguish among distinct phenomena. Discussing “hallucinations” in black-and-white terms can be misleading and uninformative, especially when conducting fine-grained evaluations of systems, and developing auxiliary AI models that can precisely detect and correct unsupported claims.
For example, text models often draw inferences that are likely but not explicitly stated from context: Consider an encounter where the conversation included a discussion of continuing a patient's dose of metformin and monitoring HbA1c levels; the generated note then references “diabetes,” even though the word “diabetes” is never mentioned in the conversation.


Drawing a precise line between reasonable inference vs. unsubstantiated extrapolation can be complicated and subjective; for example, not all clinicians may agree if a particular inference is reasonable in context.
Yet, some cases are clearly incorrect: Consider a claim that is factually contradicted by the conversation, with implications for care (e.g., “patient denies chest pain” when the patient did in fact mention their chest pain). These cases are unambiguously problematic and warrant more severe concern.
With these distinctions in mind, we present in this document our current internal guidelines for categorizing unsupported claims.
This axis assesses whether the content is explicitly affirmed by the transcript, contradicted by the transcript, or somewhere in-between.
This axis evaluates the gravity of having included unsupported claims in the clinical note, and only applies in cases where there are such claims.
For each factual claim in the generated note, we assign one of the following categories to reflect the extent to which the claim is supported by the transcript of the conversation.
Note that these categories are independent of clinical severity, which is assessed in the “severity” axis.





Not all factual errors are equally severe. Unsupported claims that directly impact patient care (e.g., treatment decisions and future diagnostic reasoning) are most severe, particularly when they are subtle and hard to detect (e.g., a slight error in dosage or frequency of a prescription). Other errors that do not directly impact patient care, or which are easily recognized, may be considered less severe.



Generating a high-quality note requires good context: the transcript and any relevant information from the EHR.
In our previous whitepaper, Pioneering the Science of AI Evaluation, we described in depth how we ensure that our automatic speech recognition (ASR) system generates transcripts that are faithful to the original conversation, far exceeding performance of off-the-shelf models.
Now, we focus on the second component of our system, which goes from a transcript to a draft of the clinical documentation. Just like our ASR system, our note-generation system consists of multiple models. Once a “first draft” of clinical documentation is produced, our system conducts further automatic refinements to convert this “first draft” into a “final draft” for the clinician to review. The system we describe below is one core part of our note generation and refinement pipeline, and we continue to iterate and improve upon it.
The Abridge system for ensuring accuracy in AI-drafted documentation has two components:

.avif)




An important component of our guardrail system is an in-house, task-specific language model trained on a curated set of over 50,000 training examples. Our data combine both open-source datasets for hallucination detection in the general domain and domain-specific examples (transcripts, EHR context, and example notes) that cover the breadth of scenarios where unsupported claims can arise in practice. These training examples are informed by our hands-on experience (both observations and feedback) building models for real-world clinical applications. This model is designed to detect and assess the level of support of each claim in the note, and to produce detailed reasoning for its assessment.
Detect and assess the level of support for each claim from the drafted note
Produce detailed reasoning for assessment
Once a claim is detected and assessed as being unsupported, then we correct it using another automated system. This system examines the identified claim and associated reasoning from the detection system to determine the best way to resolve the confabulation, among several options.
While the patient mentions Prozac, they later corrected themselves, and clarified that they have been taking Lexapro instead.
In addition to system-level automation, we also believe in clinician review. All clinician notes generated by Abridge are reviewed and edited (if necessary) by the clinicians who conducted the patient encounter before being entered into the EHR. No guardrail is perfect, and clinical conversations can contain legitimate sources of ambiguity, for instance, where the “correct” note may rely on clinical judgements that were not verbalized during the encounter.
To make clinician review as efficient as possible, every user has access to proprietary Abridge capabilities developed as part of our foundational commitment to trust and verification of AI-generated outputs (including Linked Evidence, which enables clinicians to see evidence for each AI-generated summary). This tool empowers clinicians to perform their own review and validation of the clinical document.
The combination of
AI guardrails with clinician review ensures factual accuracy for Abridge notes entered into the EHR.
The automated components of our system—catching and correcting unsupported claims in “first draft” clinical documentation—are backed by careful curation of challenging internal benchmarks as well as over a thousand hours of annotation and validation by board-certified physicians. For instance, one of our internal benchmark datasets contains over 10,000 realistic clinical encounters, including examples of realistic claims in clinical notes as well as modified versions of those claims that are unsupported to varying degrees.
These internal datasets help us compare to standard off-the-shelf solutions. On the benchmark discussed above, our technology outperforms GPT-4o by a wide margin, catching 97% of the confabulations, while GPT-4o only catches 82%.
In other words, a standard off-the-shelf model misses six times as many confabulations as the Abridge system. These results highlight the value
of a purpose-built model trained on carefully curated realistic data for this task.
Conclusion
Our vision for ambient AI is not just about faster and easier documentation:
A 2024 review of 48 research studies that examined medical records to detect errors found that in 47 of the 48 studies, errors were present, suggesting that errors can commonly be found in medical documentation. These mistakes are not harmless; from the review: “Various studies have shown that poor documentation contributes to medical errors, malpractice claims, and even patient mortality.”
We launched Abridge in 2018 to power deeper understanding in healthcare. Understanding must start with accuracy, and trust to deliver accuracy must be earned—over and over again.

That’s why it’s imperative that the deployment of ambient AI does not just maintain the status quo, but actually improves the accuracy of documentation—both to engender trust from clinicians, patients, and other stakeholders, as well as to ensure that we introduce meaningful progress when we add new technology to workflows.
By sharing how we understand and address the problem of “hallucinations,” we hope to set a higher standard for transparency and trust.
We also hope to spark a broader conversation on measuring progress towards the goal of perfect accuracy: Not all “hallucinations” are the same, and understanding the subtle-but-important differences in type and severity is crucial to our long-term aim to reduce inaccuracies in clinical documentation to near zero. As discussed above, our current system is equipped with purpose-built guardrails that already out-perform standard off-the-shelf solutions. Our ambition is to continue to measure, adjust, and build AI technology that draws us ever closer to clinical notes free of documentation errors.