Looking for disparities in our Machine Learning models
Our intention at Abridge has always been to create machine learning algorithms that support better understanding and follow-through. However, we recognize that disparities and biases exist in the U.S. healthcare system, as evidenced by a wealth of studies. For example, women experience longer delays in diagnosis and are prescribed less medication for pain, which is more often attributed to psychogenic causes.¹, ²
When applying machine learning solutions to healthcare, biases may inadvertently surface and even be amplified. A recent study found Black patients were being excluded from a beneficial health program because health costs were being used as a discriminatory proxy for health needs.³ Historically, less money is spent on Black patients, so the algorithm falsely concluded this population had less need when in fact they were as sick or sicker than White patients in the program. The potential for bias exists in other high-stakes fields such as hiring, where historical bias against women was exacerbated by an algorithm that disfavored graduates of all-women’s colleges.⁴
So how does this impact Abridge? We wanted to make sure we weren’t inadvertently perpetuating or amplifying disparities. That’s why we decided to interrogate our model for differences in performance across various groups, which could lead to an unequal distribution of benefits to our users.
A little background on our models: over the last couple of years, we’ve developed a classification model that identifies parts of the doctor-patient conversation that are relevant to the care plan for patients. These parts include takeaways, such as discussions of next steps or medication instructions. So far, we have delivered more than 90,000 takeaways to our users. Surfacing these utterances to patients can improve recall and understanding of their care plan.⁵
We wanted to make sure that this classifier worked equally well across various groups. In our paper, “Towards Fairness in Classifying Medical Conversations into SOAP Sections,” we measured differences in how well the model performs, depending on which population group was selected. The doctor-patient conversations in our model were de-identified, collected by a third party with full consent, and did not include any Abridge user data.
Measurements of disparities
We studied disparities between the following groups, organized by their protected attributes (listed in the below table). We specifically studied these groups because prior literature shows a history of bias in healthcare against these groups.⁶
Because our classification task is assistive (helping the patient recall and understand their care plan), we chose metrics that focus on true positives and false negatives. A true positive occurs when the model predicts correctly that an utterance is part of the patient’s care plan. A false negative occurs when the model predicts erroneously that an utterance is not part of the plan, when in fact it is. The three metrics we focus on are average odds difference (AOD), equal opportunity ratio (EOR), and false omission rate ratio (FORR). Measurements between 0.8 and 1.25 are considered within the fair threshold, while <0.8 implies higher benefit for the advantaged group and >1.25 a higher benefit for the disadvantaged group (illustrated in Figure 2).
Analysis of disparities
To gain a deeper understanding of differences between the advantaged and disadvantaged groups, we conducted two analyses. First, we wanted to understand differences in language usage between the groups. We carried out a local mutual information (LMI) analysis to measure the association between common phrases and the plan label. Phrases with a high association measure were strong lexical cues for the plan label. If the cues were different between the combined groups and the disadvantaged group, this suggested that the plan conversations in the disadvantaged group were considerably different.
Second, we expected differences in how much of the conversation revolved around the care plan depending on the type of medical visit (for example, visiting a primary care physician may often result in a simple annual follow-up vs. a cardiologist may involve additional tests, new medications or specialist follow-ups). We thus probed whether omitting conversations from certain types of medical visits had any effect on the classifier outcomes.
Of the 18 group pairings evaluated, we found 3 cases where the differences exceeded the established thresholds in at least 1 of the 3 metrics (see Figure 3). For each of these cases, only the false omission rate ratio (FORR) was outside the thresholds. The ratio between the two groups was less than 0.8, indicating the disadvantaged group experienced a higher rate of false omissions. That is, when the classifier predicted that an utterance in the conversation was not part of the care plan, the risk was higher for the disadvantaged group (compared to the advantaged group) that the utterance actually was part of the plan.
In all 3 cases, our LMI analysis found differences in lexical cues for the plan label. Phrases such as "blood" and "blood work" that were usually highly indicative of plan utterances were less prominent in the conversations of these disadvantaged groups. In each of the 3 cases, we found that omitting one certain type of medical visit eliminated the observed disparities (see the green hatched bars in Figure 3).
We measured the performance of our classifier across different socio-demographic groups to understand whether any disparities exist. We found small differences for a small number of groups, but which may be explained by the type of medical visit.
This study is just one part of a much broader picture, and ensuring fairness in our algorithms requires ongoing diligence. We plan to incorporate an audit process to revise this and other aspects of our machine learning solutions, with the hope that we can support everyone in better understanding the next steps in their care.
Read the paper: