Whitepaper

Pioneering the Science of AI Evaluation

Written by

Michael Oberst, Davis Liang, Zachary C. Lipton

Published

September 19, 2024

Last updated

August 7, 2025

Introduction

quality and monitoring

performance

Conclusion

"Pajama time" is now commonplace across all clinical specialties

2 hours/day

time spent after clinical hours to catch up on clinical documentation

63%

of physicians report burnout

Ambient documentation systems (or “AI scribes”) reduce documentation burden by drafting clinical notes based on raw audio of the clinical encounter. Clinicians edit and verify these notes before signing them in the Electronic Health Record (EHR). However, evaluation of these systems, and the quality of the documentation they draft, is complicated by the free-form nature of generated text and the various uses of clinical documentation.

This whitepaper describes our process for evaluating AI-generated documentation at Abridge, quantitative performance measures for many of our key systems, and our process for conducting holistic clinician-in-the-loop studies prior to green-lighting updated AI components into production. These evaluations have informed the development and deployment of our AI systems; systems that already process more than a million clinical encounters every week, across 150+ health systems. By sharing our approach, we hope to foster greater transparency and facilitate dialogue among stakeholders in this fast-moving space.

Executive summary

Commitment
and expertise

As a clinician-led company with roots in academia, we are committed to building trustworthy AI systems, and have the expertise to do so.

Quality and
monitoring process

For assessing the quality of complex free-text documents, human judgment remains the gold standard, and our evaluation process reflects the need to keep humans in the loop.

Performance

Our systems already outperform off-the-shelf clinical and open models. In this report, we provide some deep dives on particular aspects of performance we have found critical to having an impact with our health system partners.

Looking ahead

The state of evaluation in generative AI remains a rapidly developing area with novel ideas arising every week.

quality and monitoring

At the highest level, we can think of our clinical documentation engine as being comprised of two primary components: a world-class medically tailored speech recognition system and a note-generation system that transforms raw transcripts into drafted clinical notes.

Automated Speech Recognition (ASR)

The first component is our Automated Speech Recognition (ASR) system, which takes raw clinical audio and produces a transcript of the encounter.

Note Generation

The second component is our note-generation system which uses the transcript to produce a draft of the clinical documentation.

Each component uses a variety of underlying models. These are evaluated both individually and in an end-to-end fashion. We take a deliberate approach to evaluating and releasing upgrades to our core components:

quality and monitoring

Model development guided by automated metrics and clinician spot-checks

Automated metrics as a screening tool for model development

We rely on many automated metrics to guide early model development, using a large internal benchmark dataset containing clinical audio, gold standard transcripts, human-written reference notes, and rich metadata on patient characteristics.

For our automatic speech recognition system, we use canonical metrics, including word error rate and medically tailored metrics (e.g., recall of medical terms), alongside more targeted analyses (e.g., capture of newly minted medication names). For our note-generation system, we compute automated metrics of quality that compare AI-generated and human-written summaries, including variants of precision and recall (e.g., of medical concepts), which serve as proxies for factuality and completeness. Throughout, we perform stratified analyses to assess performance across diverse patient subpopulations.

Clinician-driven spot-checks

Throughout model development, Abridge clinicians spot-check notes generated on a curated set of encounters covering a range of clinical scenarios, providing both a coarse signal on difficult-to-measure aspects of quality and helping our modeling team develop intuition. This feedback is particularly important when a new model upgrade is meant to address a more subjective concern, such as incorporating certain stylistic preferences into note generation.

quality and monitoring

Validation via blinded head-to-head trials Adjudicated by licensed clinicians

While automated metrics and clinician spot-checks are helpful to guide development of upgrades to our overall system, these metrics capture only a portion of the many salient dimensions of quality. Certifying a new system for deployment requires more than informal spot-checks.

Before deploying models, we perform blinded, head-to-head evaluations with licensed clinicians as evaluators.

To this end, we developed a software platform that presents notes (one from the current system, the other from candidate systems) side-by-side, with clinical reviewers blinded to the system that authored each.

To allow for early stopping when the results are conclusive, we use anytime-valid sequential hypothesis testing¹, providing guarantees on the false positive rate that hold regardless of how long we run the trial. This approach provides strong statistical evidence for whether a proposed system is better (or worse) than its predecessor.

1. Ramdas A. Foundations of Large-Scale Sequential Experimentation. In: KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:10.1145/3292500.3332282

quality and monitoring

Staged release process

Even once our models survive a rigorous blinded test of clinician judgment, we take a careful staged-release approach to productionizing our models, verifying performance in vivo on selected cohorts prior to approving a broader release.

The first stage is an “alpha” release with limited rollouts to specially trained early adopters. These clinicians are in frequent contact with Abridge staff and are trained to be especially vigilant about editing and comprehensive in their feedback. Only once we are satisfied that these users are better served by the improved model do we consider rolling out to wider audiences.

At every stage of release, including full deployment, we continually collect both active feedback in the form of comments and star ratings, and passive feedback in the form of edits required to finalize the note, similar to the ongoing metrics we track during deployment as discussed below.

quality and monitoring

Ongoing post-deployment monitoring

Quantitative feedback at scale to track performance

First, we capture edits made

to the AI-generated note

before they are finalized shared

‍in the medical record.

Clinician-generated edits are inherently scalable as a feedback mechanism, as editing is already a natural workflow. These edits provide us with a rich signal for modeling purposes and enable the computation of high-level metrics to gauge how much editing is required. In addition, users have the option to provide quantitative ratings of note quality within the note-editing interface.

These scalable quantitative metrics not only allow us to track ongoing real-world performance, but also provide us with sample sizes required to assess performance across diverse patient and provider subpopulations. Stratified evaluations require high-quality metadata (e.g., on patient demographics) that are not reliably captured in conversations, but can in many cases be pulled (with varying degrees of accuracy²) via integration with the EHR.

We also develop model-driven proxies for meta-data that may not be captured in the medical record. For example, Abridge documents visits in a variety of languages, including visits that span multiple languages. Performing relevant language-stratified analyses requires that we can accurately identify which languages are spoken in each encounter.

2. Johnson JA, Moore B, Hwang EK, Hickner A, Yeo H. The accuracy of race & ethnicity data in US based healthcare databases: A systematic review. The American Journal of Surgery. 10:463-70. doi:10.1016/j.amjsurg.2023.05.011.

Qualitative feedback for uncovering blind spots

Clinicians frequently provide free-text feedback, which plays an equally important role. Given the diversity of clinical encounters and open-endedness of note-drafting, blind spots are inevitable. Ensuring that concerns expressed in qualitative feedback are heard by model developers is essential for finding these blind spots and developing new tests to catch them in the future.

To illustrate, consider the following real-world examples of issues that have been identified through open-ended feedback:

New medications are continuously being released

Does the documentation correctly spell medications that the models have not encountered previously?

Many visits include multiple speakers

Given an infant patient, a parent, and a clinician, does the documentation appropriately distinguish between the parent and the (non-speaking) patient?

Clinicians and patients do not always agree

If a patient misstates their own diagnosis, and a clinician corrects them, is the conclusion appropriately reflected in the final note?

Customer feedback helps to identify these types of problems, and tracking of problem categories provides another mechanism for evaluating the efficacy of deployed patches.

Qualitative feedback is also useful for eliciting stylistic preferences—one clinician may prefer that certain elements appear in the History of Present Illness (HPI) section, while another may prefer them in the Assessment & Plan section.

Audits via dedicated tools

Ongoing audits of note quality are essential for building and maintaining trust. However, audits can be inefficient—to evaluate the appropriateness of a single note sentence, an evaluator must parse a long transcript of a conversation that they have never seen before. To aid in efficient review, we outfit all of our note viewers with a tool for pinpointing the relevant transcript excerpt.

This same feature is available to our clinician users, helping them to verify the output of our system during the editing process.

Highlighting a section of the generated note (left) surfaces relevant sections of the transcript (right), streamlining the process of validating factuality.

Note viewer for evaluating an individual clinical note, with the associated transcript. Highlighting a section of the generated note (left) surfaces relevant sections of the transcript (right), streamlining the process of validating factuality.

Conclusion

Evaluation of AI-driven ambient documentation is an evolving challenge, requiring more than simply determining the right set of metrics, or doing a single evaluation at a given point in time.

Equally important is the process by which human feedback is incorporated into model development, auditing, and the construction of novel evaluation criteria and test cases.

Performance

In this section we provide selected deep dives on particular aspects of performance we have found critical to having an impact with our health system partners. While these examples do not cover every dimension of performance that we monitor, each example illustrates how targeted evaluation helps drive clinically meaningful improvements both over current off-the-shelf models, and in our own system over time.

3.1

Measuring the clinical fidelity of automatic speech recognition

3.2

Assessing and improving performance across multilingual settings

Performance

Measuring the clinical fidelity of automatic speech recognition

One of the core components of our product is our Automatic Speech Recognition (ASR) system, which converts patient-clinician conversations into a transcript, which in turn, constitutes the key input to our note-generation system.

Accurate transcription is critical to ensuring the quality of the final output, as well as an accurate record of the conversation that the clinician can reference later.

Our ASR models take the audio from a conversation as input, and produce several secondary outputs in addition to the transcript, including alignment of text to audio timestamps, and assignment of phrases to the relevant speaker (a process known as “diarization”). Here, we focus on our evaluation of the main output, the transcript itself.

Curating diverse data for evaluation

In order to appropriately evaluate our transcription model, we need diverse data on clinical conversations, including audio and gold standard transcripts. To that end, we use several benchmark datasets, including:

Over 10,000 hours of medical conversations

Internal (non-public) datasets of patient-clinician encounters

Over 10,000 hours of clinical conversations with associated audio, gold-standard reference transcripts, and human annotations.

Curated “challenge datasets''

Designed by our team to pose difficulties and probe weaknesses of clinical ASR models, including clinical conversations laden with new medication names.

Standard research benchmark datasets

Such as the Librispeech benchmark dataset.

Over 10,000 hours of clinical conversations with associated audio, gold standard reference transcripts, and human annotations.

Designed by our team to pose difficulties and probe weaknesses of clinical ASR models, including clinical conversations laden with new medication names.

Such as the Librispeech benchmark dataset.

PRIMER

Metrics for assessing quality of clinical transcription

To evaluate the quality of a transcript produced by an ASR system, we typically compare it to a gold standard reference transcript, written by a human listener.

TWO IMPORTANT METRICS ARE:

WORD ERROR RATE

Word error rate (WER) is calculated as the minimum number of word-level edits (substitutions, deletions, and insertions) to transform the reference transcript into the generated transcript, divided by the length of the reference transcript.

This calculation is illustrated in Example 1, where the transcript missed three words, and misunderstood one word (taken vs taking). Here, the minimum number of edits to correct the transcript is four: Adding back the three missing words, and correcting the misunderstood word. WER is calculated by dividing the number of edits by the length of the reference transcript, which contains 24 words, for a WER of 4 / 24 = 0.167.

Number of edits

Length of reference transcript

Word error rate

0.167

However, as shown in the example above, not all errors are created equal. The patient mentions feeling pain in their back and the use of Advil. In general, correctly transcribing these medical terms is a top priority for an ASR system specialized for clinical text.

Example 1 transcript

MEDICAL TERM RECALL RATE

Medical term recall rate (MTR) serves the purpose of tracking more clinically meaningful errors, and is a complementary metric. MTR tracks the fraction of medical terms in the reference that are captured in the generated transcript. We continue with the example above to illustrate how this metric is calculated.

In this case, the words “feeling,” “pain,” “back,” and “Advil” are all medical terms, and since the generated transcript captured three out of four terms, the MTR would be 75%.

Example 1 (continued): MTR calculation with a simple example of a reference and generated transcript.

Example 1 transcript continued

Medical terms highlighted

Comparing to off-the-shelf transcription models

Since medical conversations have substantial overlap with generic conversations, we benchmark our ASR models on public datasets to ensure that we have comparable performance in the general conversation setting. On generic speech corpora (e.g., the public Librispeech benchmark dataset), our ASR system performs comparably to state-of-the-art ASR models like Whisper v3 from OpenAI.

The true value of our speech recognition system is revealed on medical conversation transcription, where we significantly outperform off-the-shelf ASR models that are specifically designed for medical conversations  (e.g., Google Medical Conversations).

For instance, on one of our internal medical conversation benchmarks, we compare ourselves against several public ASR models:

WOrd Error rate

Medical term recall

OUR MODEL

12.7%

97.0%

Google Medical Conversations

16.6%

96.4%

AWS Transcribe Medical

17.1%

92.9%

AssemblyAI’s Universal-2

17.9%

94.6%

IN OTHER WORDS

Our model has a 24% relative reduction in word error rate

Compared to Google Medical Conversation ASR with slightly higher rate of medical term recall, and exhibits high medical term recall overall.

Other models we compared against have higher error rates

As well as substantially lower rates of medical term recall.

Our model has a 83% relative reduction in error on new medications

On a curated challenge dataset laden with new medication names, our system demonstrates a 83% relative reduction in error on new medications compared to Google Medical Conversations ASR.

Our model has a 81% relative reduction in error on new medications

Compared to OpenAI’s Whisper v3 model.

Our model has a 16% relative reduction in word error rate

Compared to Google Medical Conversation ASR with slightly higher rate of medical term recall, and exhibits high medical term recall overall.

Other models we compared against have higher error rates

As well as substantially lower rates of medical term recall.

A 45% relative reduction in error on new medications

On a curated challenge dataset laden with new medication names, our system demonstrates a 45% relative reduction in error on new medications compared to Google Medical conversation ASR.

A 40% relative reduction in error

Compared to OpenAI’s Whisper v3 model.

Performance

Assessing and improving performance across multilingual settings

In this section, we consider real-world performance of our end-to-end system as judged by our users, and illustrate how we assess performance across a diverse set of users who speak multiple languages.

Internal multilingual evaluation of our ASR and note-generation system

We start by discussing the evaluations we have used to benchmark our multilingual performance internally, and which have formed the basis for model improvements that drive our real-world multilingual performance, described at the end of this deep dive.

First, our approved set of languages is based on the observed performance of our ASR system on audio and associated gold standard transcripts from a variety of languages. One of the datasets we use in evaluation is FLEURS, a parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark. Here we have access to gold standard transcripts, which allows us to establish ASR performance by measuring WER.

Measured on this dataset, for three of the non-English languages that we support, our WER is even lower in that language, compared to English:

Remaining Languages

1% lower

Word error rate

than current state-of-the-art, on average across these remaining languages

English

6.2%

WoRd error rate

Spanish

3.1%

Word error rate

For our note-generation system, we compare the performance of our system using the original English transcript against performance using a translated version of the transcript (into e.g., Spanish). Here, we observe comparable performance, even when the transcript is not in English.

The recall rate of relevant medical terms in the final note, using non-English transcripts, is >80% of the recall rate on the original English transcripts.

Multilingual performance in the wild

The primary metric for the end-to-end evaluation of our system is quantitative ratings of note quality (or “star ratings”) that users provide within the note editing interface, giving a rating from 1 to 5. This feedback lends itself to straightforward metrics (e.g., tracking the average star rating over time), and tracking performance across categories of encounters.

For reference, over the last three months (May-July), we collected tens of thousands of star ratings, with an average rating of 4.3 out of 5 for English-language encounters. Meanwhile, our performance in our most frequent non-English languages is nearly as strong over the same time period.

English

SpaniSh

Tagalog

Vietnamese

0.0

Conclusion

The feedback flywheel

Evaluation is not just a set of guardrails, but a compass. Continuous evaluation of our product is not just designed to catch or prevent issues, but to drive improvements of our product at scale. As feedback from the field continues to suggest avenues for improving our product, our evaluations expand to cover more aspects of quality beyond the basic requirements of correctness. Rapid product improvements (in response to feedback) create a virtuous cycle—when users see that their feedback is taken seriously, they are more likely to provide feedback in the future.

As the product improves, the goalposts move.

For example:

Accented English

At Abridge, we’re committed to building a speech understanding system that works for everyone. In response to user feedback about transcription quality for accented English, we launched a focused effort that led to a 15.8% relative improvement in transcription accuracy on a challenge dataset spanning a wide range of global accents—including South and East Asian, European, South African, and Middle Eastern.

Missing problems

In the early days of building our product, users would give feedback about missing problems in the problem-based Assessment & Plan.

Presentation

Once we shipped improvements to our system that more reliably captured the relevant problems, feedback tended to focus more on how we presented that information, with clinicians in different specialties wanting different ordering of problems.

Improvements

Rigorous evaluation is not a one-time exercise on a static dataset, but a continuous and ever-evolving process that makes large-scale improvement possible.

Abridge is trusted by 150+ health systems to produce high-quality documentation, and we take that trust seriously. Setting standards for the measurement of note quality is not a unilateral effort performed by a single research team, but an exercise done in partnership with our users and the broader community. We intend to continue participating in the broader conversation on how to evaluate systems like ours, and hope to work toward a common industry framework for apples-to-apples comparisons.

Download
whitepaper

Thank you

Keep an eye out for an email from us—we will be in touch with next steps.

See our case studies

Oops! Something went wrong while submitting the form.

Download whitepaper

Report sent

Whitepaper

Pioneering the Science of AI Evaluation

Introduction

Executive summary

Commitment and expertise

Quality and monitoring process

Performance

Looking ahead

quality and monitoring

quality and monitoring

quality and monitoring

quality and monitoring

quality and monitoring

Qualitative feedback for uncovering blind spots

Performance

Performance

Our model has a 24% relative reduction in word error rate

Other models we compared against have higher error rates

Our model has a 83% relative reduction in error on new medications

Our model has a 81% relative reduction in error on new medications

Performance

Conclusion

Downloadwhitepaper

Commitment
and expertise

Quality and
monitoring process

Download
whitepaper