PART II

The Science  of Confabulation Elimination

Toward Hallucination-Free AI-Generated Clinical Notes

Series

Pioneering the Gold Standard for AI in Healthcare

Written by

Davis Liang, Michael Oberst, Chenhao Tan, Zachary C. Lipton

Published

August 19th, 2025

Introduction

Ambient AI has rapidly become ubiquitous in healthcare workflows.

The rapid adoption of this technology speaks to the tangible benefits experienced by clinicians.

In this paper, we focus on a topic that is top-of-mind for adoption of any AI-powered product: “hallucinations.”

There is now an opportunity to reduce errors in clinical documentation, even to the point of near-elimination.

With the accelerating adoption of ambient AI, there is now an opportunity to reduce these errors in clinical documentation, even to the point of near-elimination.

We start with clear definitions:

Not all unsupported claims (colloquially called “hallucinations") are created equal.

We first discuss how we characterize and categorize the degree to which claims are supported by context, and the severity of unsupported claims. Many claims sit at the boundary of “reasonable inferences,” since clinical conversations can leave room for interpretation.

We then reveal the inner workings of our note-generation pipeline to discuss how we approach the challenge of ensuring that every claim is faithful to clinical context. We focus in particular on the guardrails we have developed, supported by over one thousand hours of human validation, to automatically revise AI-generated “first drafts” in real time, preemptively detecting and correcting unsupported claims before a clinician reviews their draft documentation.

Assessing Factuality of AI-generated Documentation

Purpose-built Guardrails for Factual Accuracy, Tuned on Clinical Data

Assessing Factuality

Assessing Factuality of AI-generated Documentation

When considering unsupported claims, it is essential to provide clear definitions to distinguish among distinct phenomena. Discussing “hallucinations” in black-and-white terms can be misleading and uninformative, especially when conducting fine-grained evaluations of systems, and developing auxiliary AI models that can precisely detect and correct unsupported claims.

For example, text models often draw inferences that are likely but not explicitly stated from context: Consider an encounter where the conversation included a discussion of continuing a patient's dose of metformin and monitoring HbA1c levels; the generated note then references “diabetes,” even though the word “diabetes” is never mentioned in the conversation.

Transcript

“Continuing a patient's dose of metformin and monitoring HbA1c levels”

Note

References “diabetes”

Reasonable Inference

Drawing a precise line between reasonable inference vs. unsubstantiated extrapolation can be complicated and subjective; for example, not all clinicians may agree if a particular inference is reasonable in context.

Unsubstantiated extrapolation

Yet, some cases are clearly incorrect: Consider a claim that is factually contradicted by the conversation, with implications for care (e.g., “patient denies chest pain” when the patient did in fact mention their chest pain). These cases are unambiguously problematic and warrant more severe concern.

With these distinctions in mind, we present in this document our current internal guidelines for categorizing unsupported claims.

Evaluation and categorization of factual claims

Directly Supported

Circumstantially Supported, Reasonable Inference

Circumstantially Supported, Questionable Inference

Unmentioned

Contradiction

Support

This axis assesses whether the content is explicitly affirmed by the transcript, contradicted by the transcript, or somewhere in-between.

Major Severity

Moderate Severity

Minimal Severity

Severity

This axis evaluates the gravity of having included unsupported claims in the clinical note, and only applies in cases where there are such claims.

Support

Is the claim supported by the transcript?

For each factual claim in the generated note, we assign one of the following categories to reflect the extent to which the claim is supported by the transcript of the conversation.

Note that these categories are independent of clinical severity, which is assessed in the “severity” axis.

Directly Supported

A statement is classified as “directly supported” if the content precisely matches the transcript with no meaningful deviations or assumptions.

In particular, the statement is fully and explicitly substantiated by the transcript.

Circumstantially Supported, Reasonable Inference

A statement is classified as “circumstantially supported, reasonable inference” if the statement, while not explicitly stated in the transcript, can be logically and reasonably inferred based on available information.

In particular, other interpretations are highly unlikely and most clinicians would feel confident making the same inference from the transcript alone.

Circumstantially Supported, Questionable Inference

A statement is classified as “circumstantially supported, questionable inference” if the statement can be inferred based on available information, but where there is some doubt as to whether or not the inference is correct.

In particular, the given interpretation is plausible, but other interpretations are also plausible, and most clinicians would not feel confident making the same inference from the transcript alone.

Unmentioned

A statement should be classified as “unmentioned” if it is not at all substantiated by the transcript. These are statements that are neither explicitly covered in the transcript nor possible to infer from context.

In particular, most clinicians would agree that there is no indication in the transcript that the statement is supported.

Contradiction

A statement is classified as a “contradiction” if it directly conflicts with statements made in the transcript.

Severity

What is the expected impact of a claim that is not directly supported?

Not all factual errors are equally severe. Unsupported claims that directly impact patient care (e.g., treatment decisions and future diagnostic reasoning) are most severe, particularly when they are subtle and hard to detect (e.g., a slight error in dosage or frequency of a prescription). Other errors that do not directly impact patient care, or which are easily recognized, may be considered less severe.

Major Severity

An unsupported claim should be classified as “major” if most clinicians would agree that, if left uncorrected, the claim would most likely have a negative impact on clinical care and/or has a non-trivial chance of leading to substantial harm.

Moderate Severity

Some unsupported claims have a non-trivial, but relatively low risk of impacting patient safety or leading to inappropriate treatment. In particular, claims should be classified as “moderate” severity if most clinicians would agree that, if these claims are left uncorrected, some negative impact on clinical care is plausible, but the error is unlikely to lead to substantial harm.

Minimal Severity

An inaccurate claim is classified as “minimal severity” if most clinicians would agree that it has little to no impact on clinical care, either in the short or long term, even if left uncorrected. Note: Any inaccurate claim that could plausibly impact patient care should be rated at least “moderate severity.”

Purpose-built Guardrails

Purpose-built Guardrails for Factual Accuracy, Tuned on Clinical Data

Generating a high-quality note requires good context: the transcript and any relevant information from the EHR.

In our previous whitepaper, Pioneering the Science of AI Evaluation, we described in depth how we ensure that our automatic speech recognition (ASR) system generates transcripts that are faithful to the original conversation, far exceeding performance of off-the-shelf models.

Now, we focus on the second component of our system, which goes from a transcript to a draft of the clinical documentation. Just like our ASR system, our note-generation system consists of multiple models. Once a “first draft” of clinical documentation is produced, our system conducts further automatic refinements to convert this “first draft” into a “final draft” for the clinician to review. The system we describe below is one core part of our note generation and refinement pipeline, and we continue to iterate and improve upon it.

The Abridge system for ensuring accuracy in AI-drafted documentation has two components:

A proprietary, task-specific AI model for detecting unsupported claims in draft documentation

An automated system for correcting unsupported claims, leveraging the conversation transcript and EHR context

Detecting unsupported claims

An important component of our guardrail system is an in-house, task-specific language model trained on a curated set of over 50,000 training examples. Our data combine both open-source datasets for hallucination detection in the general domain and domain-specific examples (transcripts, EHR context, and example notes) that cover the breadth of scenarios where unsupported claims can arise in practice. These training examples are informed by our hands-on experience (both observations and feedback) building models for real-world clinical applications. This model is designed to detect and assess the level of support of each claim in the note, and to produce detailed reasoning for its assessment.

Detect and assess the level of support for each claim from the drafted note

Produce detailed reasoning for assessment

The statement is supported by the conversation because...

The statement is not supported by the conversation because...

Automatic self-correction

Once a claim is detected and assessed as being unsupported, then we correct it using another automated system. This system examines the identified claim and associated reasoning from the detection system to determine the best way to resolve the confabulation, among several options.

Correct the claim to align with the clinical context

Delete the claim entirely

Decide that it is a false alarm, given the reasoning

First draft of note

Correction

Patient has been

taking

Prozac

Lexapro

for depression for the last two years.

Transcript

Clinician

Patient

You mentioned feeling depressed—are you taking, or have you considered taking, any medications for that?

Yea I’ve been on something for a while, about the last two years, Prozac I think?

Okay, got it.

Actually wait, no, not Prozac, it’s Lexapro, sorry.

Reasoning

While the patient mentions Prozac, they later corrected themselves, and clarified that they have been taking Lexapro instead.

Clinician review

In addition to system-level automation, we also believe in clinician review. All clinician notes generated by Abridge are reviewed and edited (if necessary) by the clinicians who conducted the patient encounter before being entered into the EHR. No guardrail is perfect, and clinical conversations can contain legitimate sources of ambiguity, for instance, where the “correct” note may rely on clinical judgements that were not verbalized during the encounter.

To make clinician review as efficient as possible, every user has access to proprietary Abridge capabilities developed as part of our foundational commitment to trust and verification of AI-generated outputs (including Linked Evidence, which enables clinicians to see evidence for each AI-generated summary). This tool empowers clinicians to perform their own review and validation of the clinical document.
‍
The combination of  AI guardrails with clinician review ensures factual accuracy for Abridge notes entered into the EHR.

Validating the automated system

Setting New Benchmarks for Accuracy

The automated components of our system—catching and correcting unsupported claims in “first draft” clinical documentation—are backed by careful curation of challenging internal benchmarks as well as over a thousand hours of annotation and validation by board-certified physicians. For instance, one of our internal benchmark datasets contains over 10,000 realistic clinical encounters, including examples of realistic claims in clinical notes as well as modified versions of those claims that are unsupported to varying degrees.

These internal datasets help us compare to standard off-the-shelf solutions. On the benchmark discussed above, our technology outperforms GPT-4o by a wide margin, catching 97% of the confabulations, while GPT-4o only catches 82%.

In other words, a standard off-the-shelf model misses six times as many confabulations as the Abridge system. These results highlight the value  of a purpose-built model trained on carefully curated realistic data for this task.

Percentage of  all unsupported claims caught

GENERAL-PURPOSE MODELS

OpenAl's GPT4o

82%

97%

Conclusion

Our vision for ambient AI is not just about faster and easier documentation:

The opportunity is clear:

A 2024 review of 48 research studies that examined medical records to detect errors found that in 47 of the 48 studies, errors were present, suggesting that errors can commonly be found in medical documentation. These mistakes are not harmless; from the review: “Various studies have shown that poor documentation contributes to medical errors, malpractice claims, and even patient mortality.”

We launched Abridge in 2018 to power deeper understanding in healthcare. Understanding must start with accuracy, and trust to deliver accuracy must be earned—over and over again.

That’s why it’s imperative that the deployment of ambient AI does not just maintain the status quo, but actually improves the accuracy of documentation—both to engender trust from clinicians, patients, and other stakeholders, as well as to ensure that we introduce meaningful progress when we add new technology to workflows.

By sharing how we understand and address the problem of “hallucinations,” we hope to set a higher standard for transparency and trust.

We also hope to spark a broader conversation on measuring progress towards the goal of perfect accuracy: Not all “hallucinations” are the same, and understanding the subtle-but-important differences in type and severity is crucial to our long-term aim to reduce inaccuracies in clinical documentation to near zero. As discussed above, our current system is equipped with purpose-built guardrails that already out-perform standard off-the-shelf solutions. Our ambition is to continue to measure, adjust, and build AI technology that draws us ever closer to clinical notes free of documentation errors.

Download whitepaper

Download Whitepaper

Enter your Work email

Thank you

Keep an eye out for an email from us—we will be in touch with next steps.

See our case studies

Oops! Something went wrong while submitting the form.

The Science of Confabulation Elimination

Toward Hallucination-Free AI-Generated Clinical Notes

Introduction

Not all unsupported claims (colloquially called “hallucinations") are created equal.

Assessing Factuality of AI-generated Documentation

Conclusion

Our vision for ambient AI is not just about faster and easier documentation:

Download Whitepaper

Download whitepaper

Report sent

The Science  of Confabulation Elimination