Jan 12, 2021
Elisa Ferracane
Machine Learning

Looking for disparities in our Machine Learning models

Our intention at Abridge has always been to create machine learning algorithms that support better understanding and follow-through. However, we recognize that disparities and biases exist in the U.S. healthcare system, as evidenced by a wealth of studies. For example, women experience longer delays in diagnosis and are prescribed less medication for pain, which is more often attributed to psychogenic causes.¹, ²

When applying machine learning solutions to healthcare, biases may inadvertently surface and even be amplified. A recent study found Black patients were being excluded from a beneficial health program because health costs were being used as a discriminatory proxy for health needs.³ Historically, less money is spent on Black patients, so the algorithm falsely concluded this population had less need when in fact they were as sick or sicker than White patients in the program. The potential for bias exists in other high-stakes fields such as hiring, where historical bias against women was exacerbated by an algorithm that disfavored graduates of all-women’s colleges.

So how does this impact Abridge? We wanted to make sure we weren’t inadvertently perpetuating or amplifying disparities. That’s why we decided to interrogate our model for differences in performance across various groups, which could lead to an unequal distribution of benefits to our users.

A little background on our models: over the last couple of years, we’ve developed a classification model that identifies parts of the doctor-patient conversation that are relevant to the care plan for patients. These parts include takeaways, such as discussions of next steps or medication instructions. So far, we have delivered more than 90,000 takeaways to our users. Surfacing these utterances to patients can improve recall and understanding of their care plan.

We wanted to make sure that this classifier worked equally well across various groups. In our paper, “Towards Fairness in Classifying Medical Conversations into SOAP Sections,” we measured differences in how well the model performs, depending on which population group was selected. The doctor-patient conversations in our model were de-identified, collected by a third party with full consent, and did not include any Abridge user data.

Measurements of disparities

We studied disparities between the following groups, organized by their protected attributes (listed in the below table). We specifically studied these groups because prior literature shows a history of bias in healthcare against these groups.

The table lists the 7 protected attributes studied in this paper, and their corresponding disadvantaged and advantaged groups. For Race/Ethnicity, the disadvantaged groups are Black, Hispanic, Asian and the advantaged group is White. For Gender, the disadvantaged is Female (both for patient and physician) and the advantaged is Male. For the intersectional attributes of Race/Ethnicity and Gender, the disadvantaged are Black female, Hispanic female and advantaged are White male. For socioeconomic, the disadvantaged groups are Unemployed, Retired, living in a nursing home, Incarcerated, Medicaid insurance, Uninsured; the advantaged are Full-time job, Living at home and Private insurance. For obesity, the disadvantaged are a weight greater than or equal to 250 pounds and the disadvantaged are less than 250 pounds. For Mental health, the disadvantaged are visits to a Psychiatrist and the advantaged are visits to Other specialties. For location, the disadvantaged group is visits in Other U.S. state, the advantaged are lists in Florida, California and New York.
Figure 1: Protected attributes and their respective disadvantaged and advantaged groups.

Because our classification task is assistive (helping the patient recall and understand their care plan), we chose metrics that focus on true positives and false negatives. A true positive occurs when the model predicts correctly that an utterance is part of the patient’s care plan. A false negative occurs when the model predicts erroneously that an utterance is not part of the plan, when in fact it is. The three metrics we focus on are average odds difference (AOD), equal opportunity ratio (EOR), and false omission rate ratio (FORR). Measurements between 0.8 and 1.25 are considered within the fair threshold, while <0.8 implies higher benefit for the advantaged group and >1.25 a higher benefit for the disadvantaged group (illustrated in Figure 2).

The thresholds used the metrics of group disparities. Less than 0.8 indicates higher benefit for the advantaged group. Greater than 1.25 indicates higher benefit for the disadvantaged group. Values between 0.8 and 1.25 are within the thresholds.
Figure 2: Thresholds for metrics of group disparities, illustrated with blue shading.

Analysis of disparities

To gain a deeper understanding of differences between the advantaged and disadvantaged groups, we conducted two analyses. First, we wanted to understand differences in language usage between the groups. We carried out a local mutual information (LMI) analysis to measure the association between common phrases and the plan label. Phrases with a high association measure were strong lexical cues for the plan label. If the cues were different between the combined groups and the disadvantaged group, this suggested that the plan conversations in the disadvantaged group were considerably different.

Second, we expected differences in how much of the conversation revolved around the care plan depending on the type of medical visit (for example, visiting a primary care physician may often result in a simple annual follow-up vs. a cardiologist may involve additional tests, new medications or specialist follow-ups). We thus probed whether omitting conversations from certain types of medical visits had any effect on the classifier outcomes.

Our findings

Of the 18 group pairings evaluated, we found 3 cases where the differences exceeded the established thresholds in at least 1 of the 3 metrics (see Figure 3). For each of these cases, only the false omission rate ratio (FORR) was outside the thresholds. The ratio between the two groups was less than 0.8, indicating the disadvantaged group experienced a higher rate of false omissions. That is, when the classifier predicted that an utterance in the conversation was not part of the care plan, the risk was higher for the disadvantaged group (compared to the advantaged group) that the utterance actually was part of the plan.

In all 3 cases, our LMI analysis found differences in lexical cues for the plan label. Phrases such as "blood" and "blood work" that were usually highly indicative of plan utterances were less prominent in the conversations of these disadvantaged groups. In each of the 3 cases, we found that omitting one certain type of medical visit eliminated the observed disparities (see the green hatched bars in Figure 3).

3 cases where the False Omission Rate Ratio exceeds the established thresholds when classifying portions of the conversation as part of a patient's care plan. The disadvantaged and advantaged groups in the 3 cases are Asian and White, Hispanic female and White male, Incarcerated and Living at home. In all 3 cases, the disparities no longer exceed the thresholds when certain medical visits are omitted.
Figure 3: Metrics of group disparities for the 3 cases where at least one metric exceeds the thresholds. Blue bars indicate group disparities are within the thresholds, pink bars indicate disparities exceed the thresholds. The green hatched bars indicate the disparities which no longer exceed the thresholds after dropping the problematic medical provider type.


We measured the performance of our classifier across different socio-demographic groups to understand whether any disparities exist. We found small differences for a small number of groups, but which may be explained by the type of medical visit. 

This study is just one part of a much broader picture, and ensuring fairness in our algorithms requires ongoing diligence. We plan to incorporate an audit process to revise this and other aspects of our machine learning solutions, with the hope that we can support everyone in better understanding the next steps in their care.

Read the paper:

Towards Fairness in Classifying Medical Conversations into SOAP Sections


¹Gender inequalities in the promptness of diagnosis of bladder and renal cancer after symptomatic presentation: evidence from secondary analysis of an English primary care audit survey

²“Brave Men” and “Emotional Women”: A Theory-Guided Literature Review on Gender Bias in Health Care and Gendered Norms towards Patients with Chronic Pain

³Dissecting racial bias in an algorithm used to manage the health of populations

Amazon scraps secret AI recruiting tool that showed bias against women

Towards an Automated SOAP Note: Classifying Utterances from Medical Conversations

Unequal treatment: confronting racial and ethnic disparities in health care

About the Author

Elisa Ferracane

NLP Research Scientist

Elisa is an NLP researcher on the Machine Learning team at Abridge. She has a PhD in Linguistics from the University of Texas at Austin, where she focused on how to capture the ambiguity and subjectivity of discourse in dialogue. Before her graduate studies, she was a lead software developer at IBM.