Early Access Impact Results from OpenAI’s GPT-5.5

By Chaitanya Asawa, Matt Sanders, and Shubh Gupta

Today, OpenAI announced GPT-5.5, their most capable model yet for professional work with a focus on ownership and autonomy. Abridge has been testing this model in early access across core products and workflows (specifically note generation and clinical decision support).

Abridge builds AI that operates within the clinical encounter: generating documentation from patient-clinician conversations, and surfacing evidence-based clinical insights at the point of care sourced from trusted medical sources. Both serve clinicians who are domain experts, working in an environment where accuracy isn't aspirational—it's a requirement. Building Abridge requires addressing hundreds of AI-powered tasks at any given moment. Abridge builds many of these features by training models on proprietary data flywheels while also leveraging the strongest capabilities of foundation models, like GPT-5.5, to power other features. The constellation of models allows us to constantly be at the edge of the pareto frontier.

That mandate requires us to be demanding evaluators of new models, and it makes model quality a genuine consideration for how quickly we build new solutions for clinicians. When a new foundation model becomes available, we don't swap it in and hope. We evaluate rigorously across the dimensions that matter most for our AI models, and only ship what clears the high bar set by our team, in-house clinical staff, and our partners. Here's what we learned while being one of OpenAI’s early test partners for GPT-5.5.

How We Think About Evaluation

Our approach to evaluation at Abridge is driven by the reality that AI for healthcare operates in a domain where errors carry real consequences. We've written in depth about our evaluation methodology, specifically about clinical decision support with our multi-layer validation process and the role of clinician-in-the-loop review (in How We Evaluate Clinical Decision Support for Enterprise Readiness).

At a high level, we grade new models noting multiple factors that matter: clinical accuracy, completeness, faithfulness to the source conversation, tool utilization, multi-step reasoning, and response quality. A model that improves on one axis but materially regresses on others doesn't make its way into our technology stack. As our platform continues to expand beyond ambient documentation to areas like revenue cycle, nursing, and clinical decision support, we stress-test against workloads that exercise fundamentally different capabilities—from faithful summarization of long, unstructured conversations to multi-step synthesis of patient context with medical literature.

In preparation for the recent launch of our clinical decision support solution, each test case was evaluated against a rubric built by multiple physicians who independently reviewed a real, de-identified encounter, then calibrated to a common evaluation rubric—an evaluation set covering over 1,000 rubrics. Everything we launch to partners must pass this same level of rigor.

Using Early Results to Accelerate Development

Clinical decision support is a fundamentally different problem for AI than documentation. When a clinician asks a detailed medical question mid-encounter, our clinical decision support solution uses the patient's specific context from the conversation to surface the most relevant evidence and tools from trusted sources, giving the clinician well-cited, contextually relevant information to support their clinical decision making.

This is where GPT-5.5 had the most dramatic impact in expediting our development plans.

GPT-5.5 Scorecard vs. GPT 5.2

Clinical Quality Results

→ 25% relative increase

Conciseness

→ 30% less verbose

Safety

→ Increased safety against adversarial cases

Better tool utilization: The model meaningfully improved our ability to leverage a full set of clinical evidence tools to surface retrieved evidence with the specific patient context from the conversation.

We evaluated the new OpenAI model against GPT-5.2, specifically looking at how well each model uses tools to ground its responses in retrieved medical evidence. When giving the previous model access to additional retrieval tools, broadening the tool descriptions to encourage the model to call them more liberally, and iterating on the prompt to nudge it toward more comprehensive tool-grounded responses, there was no meaningful movement on element coverage on our rubric-based clinical quality evaluation. When adjustments improved performance in one area, such as completeness or tool utilization, we often saw decreases in others, like faithfulness to the source conversation or response quality.

When we gave GPT-5.5 access to additional retrieval tools, element coverage meaningfully improved, and that lift came without any corresponding slip in negative compliance. Where GPT-5.2 traded one axis for another every time we expanded the toolbox, GPT-5.5 pushed positive coverage higher while holding its safety profile intact.

With better tool utilization and stronger reasoning, we saw a 25% relative increase in our ability to provide greater specificity to match elements in our rubric-based clinical quality evaluation—allowing the model to more effectively utilize additional content sources for clinical decision support.

Token Efficiency: Even with our best-performing setup for the prior model, GPT-5.5 covered more of the clinical elements clinicians expect to see while producing meaningfully shorter responses. As we expanded the set of tools and evidence sources available to the agent, GPT-5.5 continued to push coverage higher without making its responses longer, whereas the prior model's outputs remained verbose regardless of configuration, and still covered less ground. That combination means less text for clinicians to read mid-encounter, and more clinically relevant information per word they do read. The ability to produce concise responses while being more informative is crucial in a clinical setting, and with GPT-5.5 we were able to achieve higher clinical quality element coverage while still being 30% less verbose than with the previous model.

Behavioral Safety Across Turns: Under adversarial and multi-turn conditions, we observed how consistently GPT-5.5 maintained appropriate refusal boundaries. In these settings, model behavior can be sensitive to shifts in phrasing or context, particularly when follow-up prompts are indirect or introduce ambiguity. Compared to earlier iterations, with GPT-5.5 we observed improved calibration in both directions—more consistently declining out-of-scope or adversarial requests, while appropriately responding to legitimate clinical and documentation questions. These improvements are especially apparent in multi-turn interactions, where GPT-5.5 better preserves safety boundaries across turns and remains robust to adversarial context shifts.

“As a GPT-5.5 launch partner, Abridge set the standard for evaluating model performance in healthcare. Their results, including a 25% lift in clinical quality coverage, 30% more concise responses, and stronger safety under adversarial stress, show what’s possible when advanced models are paired with deep domain expertise and high standards.”

Barret Zoph

GM of B2B, OpenAI

That matters because Abridge is continuously expanding what CDS can do. A model that degrades under added complexity forces trade-offs; a model that doesn't is a platform unlock.

Looking Ahead

The gains we're seeing across documentation and decision support reinforce a core thesis at Abridge: that the patient-clinician conversation contains some of healthcare’s most meaningful insights, and better foundation models unlock more of that conversation’s value for clinicians. GPT-5.5 doesn't just improve individual outputs—it expands the surface area of what we can build.

As we continue expanding Abridge's capabilities across the arc of clinical care, from preparation to documentation to decision support, models like GPT-5.5 are what make that ambition practical.

We're grateful to OpenAI for early access and for the partnership that made this evaluation possible. And we remain committed to what we think is the right approach for AI in healthcare: rigorous, independent evaluation of every foundation model, measured against the standards that matter most to the clinicians and health systems we serve.

If building AI for one of the hardest and most consequential domains sounds like the kind of problem you want to work on, we're hiring.

Report sent

Early Access Impact Results from OpenAI’s GPT-5.5

How We Think About Evaluation

Using Early Results to Accelerate Development

Looking Ahead

Want to learn more about how Abridge can help?