Revisiting Gender Bias in Learner Evaluations
If you’ve ever served as a program director, clerkship coordinator, or faculty evaluator in graduate medical education, you’ve likely wrestled with one of the most uncomfortable truths in our field: evaluation is never entirely objective. As much as we strive to be fair and evidence-based, our feedback—both quantitative and narrative—is filtered through a lens of human perception, shaped by culture, context, and yes, bias.
Two studies, published 21 years apart, can help us see just how persistent and nuanced those biases can be—especially around gender.
In 2004, my colleagues and I published a study in Medical Education titled “Evaluation of interns by senior residents and faculty: is there any difference?” (1) We were curious about how interns were assessed by two very different evaluator groups—senior residents and attending physicians. We found something interesting: the ratings given by residents were consistently higher than those from faculty. Senior faculty were surprisingly, significantly less likely to make negative comments. And more than that, the comments from senior residents were often more positive and personal. We speculated about why—perhaps residents were more empathic, closer to the intern experience, or more generous in peer evaluation.
But what we didn’t find in that study—and what medical educators are still working to unpack—was how factors like gender influence evaluations. We did not find any differences in the comments based on the gender of the evaluators, but the numbers were small enough that it was not clear if that had meaning.
That’s where a new article by Jessica Hane and colleagues in the Journal of Graduate Medical Education (2) makes a significant contribution.
The Gender Lens: a distortion or an added quality?
Hane et al. examined nearly 10,000 faculty evaluations of residents and fellows at a large academic medical center over five years. They looked at both numerical ratings and narrative comments, and they did something smart: they parsed out gender differences in both the evaluator and the evaluated. The findings? Gender disparities persist—but in subtle, revealing ways.
On average, women trainees received slightly lower numerical scores than their male counterparts, despite no evidence of performance differences. More strikingly, the language used in narrative comments showed clear patterns: male trainees were more likely to be described with competence-oriented language (“knowledgeable,” “confident,” “leader”), while women were more often praised for warmth and diligence (“caring,” “hard-working,” “team player”).
These aren’t new stereotypes, but their persistence in our evaluation systems is troubling. When we unconsciously associate men with competence and women with effort or empathy, we risk reinforcing old hierarchies. Even well-intentioned praise can pigeonhole trainees in ways that affect advancement, self-perception, and professional identity.
When bias feels like familiarity…
What’s particularly interesting is how these newer findings echo—and contrast with—what we saw back in 2004. Our study didn’t find any differences with gender specifically, but we did notice that evaluators closer to the front lines (senior residents) tended to focus more on relationships, encouragement, and potential. Faculty, and particularly senior faculty on the other hand, leaned toward more critical or objective assessments.
What happens, then, when those lenses intersect with gender? Are residents more likely to relate to and uplift women colleagues in the same way they uplift peers generally? Or does bias show up even in peer feedback, especially in high-stakes environments like residency? Hane’s study doesn’t fully answer that, but it opens the door for future research—and introspection.
The Myth of the “Objective” Evaluation
One of the biggest myths in medical education is that our evaluations are merit-based and free from bias. We put a lot of stock in numerical ratings, milestone checkboxes, and structured forms. But as both of these studies remind us, the numbers are only part of the story—and even they are shaped by deeper cultural narratives.
If you’ve ever read a stack of end-of-rotation evaluations, you know how much weight narrative comments can carry. One well-written paragraph can influence a Clinical Competency Committee discussion more than a dozen Likert-scale boxes. So when those comments are subtly gendered—when one resident is “sharp and assertive” and another is “kind and dependable”—we’re not just describing; we’re defining their potential. And that’s a problem.
What Can We Do About It?
Fortunately, awareness is the first step to addressing bias, and there are concrete steps we can take. Here are a few that I think are worth highlighting:
1. Train faculty and residents on implicit bias in evaluations. The research is clear: we all carry unconscious biases. But bias awareness training—when done well—can reduce the influence of those biases, especially in high-stakes assessments.
2. Structure narrative feedback to reduce ambiguity. Ask evaluators to comment on specific competencies (e.g., clinical reasoning, professionalism, communication) rather than open-ended impressions. This can shift focus from personal attributes to observable behaviors.
3. Use language analysis tools to monitor patterns. Some residency programs are now using AI tools to scan applications for gendered language (3) and to look at letters of recommendation for concerning language (4). It’s not about punishing faculty—it’s about reflection and improvement.
4. Encourage multiple perspectives. A single evaluation can reflect a single bias. Triangulating feedback from residents, peers, patients, and faculty can provide a fuller, fairer picture of a learner’s strengths and areas for growth.
5. Revisit how we use evaluations in decisions. Promotion and remediation decisions should weigh context. A low rating from one evaluator might reflect bias more than performance. Committees need to be trained to interpret evaluations with a critical eye.
We’re All Still Learning
As someone who’s worked in medical education for decades, I can say with humility that I’ve probably written my fair share of biased evaluations. Not intentionally, but unavoidably. Like most educators, I want to be fair, supportive, and accurate—but we’re all products of our environments. Recognizing that is not an indictment. It’s an invitation.
The Hane study reminds us that even as our systems evolve, old habits linger. The Ringdahl, Delzell & Kruse study showed that who does the evaluating matters. Put those together, and the message is clear: we need to continuously examine how—and by whom—assessments are being made.
Because in the end, evaluations are not just about feedback. They’re about opportunity, identity, and trust. If we want our learning environments to be truly inclusive and equitable, then we have to be willing to see where our blind spots are—and do the hard work of correcting them.
References
(1) Ringdahl, E.N., Delzell, J.E. and Kruse, R.L. Evaluation of interns by senior residents and faculty: is there any difference? Medical Education 2004; 38: 646-651. https://doi.org/10.1111/j.1365-2929.2004.01832.x
(2) Hane J, Lee V, Zhou Y, Mustapha T, et al. Examining Gender-Based Differences in Quantitative Ratings and Narrative Comments in Faculty Assessments by Residents and Fellows. J Grad Med Educ 2025; 17 (3): 338–346. doi: https://doi.org/10.4300/JGME-D-24-00627.1.
(3) Sumner MD, Howell TC, Soto AL, Kaplan S, et al. The Use of Artificial Intelligence in Residency Application Evaluation-A Scoping Review. J Grad Med Educ. 2025; 17 (3): 308-319. doi: 10.4300/JGME-D-24-00604.1. Epub 2025 Jun 16. PMID: 40529251; PMCID: PMC12169010.
(4) Sarraf D, Vasiliu V, Imberman B, Lindeman B. Use of artificial intelligence for gender bias analysis in letters of recommendation for general surgery residency candidates. Am J Surg. 2021; 222 (6): 1051-1059. doi: 10.1016/j.amjsurg.2021.09.034. Epub 2021 Oct 2. PMID: 34674847.