Introduction

This blog is about medical education in the US and around the world. My interest is in education research and the process of medical education.



The lawyers have asked that I add a disclaimer that makes it clear that these are my personal opinions and do not represent any position of any University that I am affiliated with including the American University of the Caribbean, the University of Kansas, the KU School of Medicine, Florida International University, or the FIU School of Medicine. Nor does any of this represent any position of the Northeast Georgia Medical Center or Northeast Georgia Health System.



Friday, July 11, 2025

Revisiting Gender Bias in Learner Evaluations

 Revisiting Gender Bias in Learner Evaluations

If you’ve ever served as a program director, clerkship coordinator, or faculty evaluator in graduate medical education, you’ve likely wrestled with one of the most uncomfortable truths in our field: evaluation is never entirely objective. As much as we strive to be fair and evidence-based, our feedback—both quantitative and narrative—is filtered through a lens of human perception, shaped by culture, context, and yes, bias.

Two studies, published 21 years apart, can help us see just how persistent and nuanced those biases can be—especially around gender.

In 2004, my colleagues and I published a study in Medical Education titled “Evaluation of interns by senior residents and faculty: is there any difference?” (1) We were curious about how interns were assessed by two very different evaluator groups—senior residents and attending physicians. We found something interesting: the ratings given by residents were consistently higher than those from faculty. Senior faculty were surprisingly, significantly less likely to make negative comments. And more than that, the comments from senior residents were often more positive and personal. We speculated about why—perhaps residents were more empathic, closer to the intern experience, or more generous in peer evaluation.

But what we didn’t find in that study—and what medical educators are still working to unpack—was how factors like gender influence evaluations. We did not find any differences in the comments based on the gender of the evaluators, but the numbers were small enough that it was not clear if that had meaning.

That’s where a new article by Jessica Hane and colleagues in the Journal of Graduate Medical Education (2) makes a significant contribution.

The Gender Lens: a distortion or an added quality?

Hane et al. examined nearly 10,000 faculty evaluations of residents and fellows at a large academic medical center over five years. They looked at both numerical ratings and narrative comments, and they did something smart: they parsed out gender differences in both the evaluator and the evaluated. The findings? Gender disparities persist—but in subtle, revealing ways.

On average, women trainees received slightly lower numerical scores than their male counterparts, despite no evidence of performance differences. More strikingly, the language used in narrative comments showed clear patterns: male trainees were more likely to be described with competence-oriented language (“knowledgeable,” “confident,” “leader”), while women were more often praised for warmth and diligence (“caring,” “hard-working,” “team player”).

These aren’t new stereotypes, but their persistence in our evaluation systems is troubling. When we unconsciously associate men with competence and women with effort or empathy, we risk reinforcing old hierarchies. Even well-intentioned praise can pigeonhole trainees in ways that affect advancement, self-perception, and professional identity.

When bias feels like familiarity…

What’s particularly interesting is how these newer findings echo—and contrast with—what we saw back in 2004. Our study didn’t find any differences with gender specifically, but we did notice that evaluators closer to the front lines (senior residents) tended to focus more on relationships, encouragement, and potential. Faculty, and particularly senior faculty on the other hand, leaned toward more critical or objective assessments.

What happens, then, when those lenses intersect with gender? Are residents more likely to relate to and uplift women colleagues in the same way they uplift peers generally? Or does bias show up even in peer feedback, especially in high-stakes environments like residency? Hane’s study doesn’t fully answer that, but it opens the door for future research—and introspection.

The Myth of the “Objective” Evaluation

One of the biggest myths in medical education is that our evaluations are merit-based and free from bias. We put a lot of stock in numerical ratings, milestone checkboxes, and structured forms. But as both of these studies remind us, the numbers are only part of the story—and even they are shaped by deeper cultural narratives.

If you’ve ever read a stack of end-of-rotation evaluations, you know how much weight narrative comments can carry. One well-written paragraph can influence a Clinical Competency Committee discussion more than a dozen Likert-scale boxes. So when those comments are subtly gendered—when one resident is “sharp and assertive” and another is “kind and dependable”—we’re not just describing; we’re defining their potential.  And that’s a problem.

What Can We Do About It?

Fortunately, awareness is the first step to addressing bias, and there are concrete steps we can take. Here are a few that I think are worth highlighting:

1. Train faculty and residents on implicit bias in evaluations. The research is clear: we all carry unconscious biases. But bias awareness training—when done well—can reduce the influence of those biases, especially in high-stakes assessments.

2. Structure narrative feedback to reduce ambiguity. Ask evaluators to comment on specific competencies (e.g., clinical reasoning, professionalism, communication) rather than open-ended impressions. This can shift focus from personal attributes to observable behaviors.

3. Use language analysis tools to monitor patterns. Some residency programs are now using AI tools to scan applications for gendered language (3) and to look at letters of recommendation for concerning language (4). It’s not about punishing faculty—it’s about reflection and improvement.

4. Encourage multiple perspectives. A single evaluation can reflect a single bias. Triangulating feedback from residents, peers, patients, and faculty can provide a fuller, fairer picture of a learner’s strengths and areas for growth.

5. Revisit how we use evaluations in decisions. Promotion and remediation decisions should weigh context. A low rating from one evaluator might reflect bias more than performance. Committees need to be trained to interpret evaluations with a critical eye.

We’re All Still Learning

As someone who’s worked in medical education for decades, I can say with humility that I’ve probably written my fair share of biased evaluations. Not intentionally, but unavoidably. Like most educators, I want to be fair, supportive, and accurate—but we’re all products of our environments. Recognizing that is not an indictment. It’s an invitation.

The Hane study reminds us that even as our systems evolve, old habits linger. The Ringdahl, Delzell & Kruse study showed that who does the evaluating matters. Put those together, and the message is clear: we need to continuously examine how—and by whom—assessments are being made.

Because in the end, evaluations are not just about feedback. They’re about opportunity, identity, and trust. If we want our learning environments to be truly inclusive and equitable, then we have to be willing to see where our blind spots are—and do the hard work of correcting them.

References

(1) Ringdahl, E.N., Delzell, J.E. and Kruse, R.L. Evaluation of interns by senior residents and faculty: is there any difference? Medical Education 2004; 38: 646-651. https://doi.org/10.1111/j.1365-2929.2004.01832.x

(2) Hane J, Lee V, Zhou Y, Mustapha T, et al.  Examining Gender-Based Differences in Quantitative Ratings and Narrative Comments in Faculty Assessments by Residents and Fellows. J Grad Med Educ  2025; 17 (3): 338–346. doi: https://doi.org/10.4300/JGME-D-24-00627.1.   

(3) Sumner MD, Howell TC, Soto AL, Kaplan S, et al.  The Use of Artificial Intelligence in Residency Application Evaluation-A Scoping Review. J Grad Med Educ. 2025; 17 (3): 308-319. doi: 10.4300/JGME-D-24-00604.1. Epub 2025 Jun 16. PMID: 40529251; PMCID: PMC12169010.

(4) Sarraf D, Vasiliu V, Imberman B, Lindeman B. Use of artificial intelligence for gender bias analysis in letters of recommendation for general surgery residency candidates. Am J Surg. 2021; 222 (6): 1051-1059. doi: 10.1016/j.amjsurg.2021.09.034. Epub 2021 Oct 2. PMID: 34674847. 


Tuesday, July 8, 2025

Medical Student Reflection Exercises Created Using AI: Can We Tell the Difference, and Does It Matter?

 

Medical Student Reflection Exercises Created Using AI: Can We Tell the Difference, and Does It Matter?

If you’ve spent any time in medical education recently—whether in lectures, clinical supervision, or curriculum design—you’ve likely been a part of the growing conversation around student (and resident/fellow) use of generative AI. From drafting SOAP notes to summarizing journal articles, AI tools like ChatGPT are rapidly becoming ubiquitous. But now we’re seeing them show up in more personal activities such as reflective assignments. A new question has emerged: can educators really tell the difference between a student’s genuine reflection and something written by AI?

The recent article in Medical Education by Wraith et al (1) took a shot at this question. They conducted an elegant, slightly disconcerting study: faculty reviewers were asked to distinguish between reflective writing submitted by actual medical students and those generated by AI. The results? About as accurate as flipping a coin, maybe a little better. Accuracy was between 64% and 75%, regardless of the faculty member’s experience or confidence. They did seem to get better as they read more reflections.

I’ll admit, when I first read this, I had a visceral reaction. Something about the idea that we can’t tell what’s “real” from what’s machine-generated in a genre that is supposed to be deeply personal—reflective writing—felt jarring. Aren’t we trained to pick up on nuance, empathy, sincerity? But as I sat with it, I realized the issue goes much deeper than just our ability to “spot the fake.” It forces us to confront how we define authenticity, the purpose of reflection in medical education, and how we want to relate to the tools that are now part of our students’ daily workflows.

What Makes a Reflection Authentic?

We often emphasize reflection as a professional habit: a way to develop clinical insight, emotional intelligence, and lifelong learning. But much of that hinges on the assumption that the act of writing the reflection is what promotes growth. If a student bypasses that internal process and asks an AI to “write a reflection on breaking bad news to a patient,” I worry that the learning opportunity is lost.

But here’s the rub: the Wraith study didn’t test whether students were using AI to replace reflection or to aid it. It simply asked whether educators could tell the difference. And they could not do that reliably. This suggests that AI can replicate the tone, structure, and emotional cadence that we expect a medical student to provide in a reflective essay. That is both fascinating and problematic.

If AI can mimic reflective writing well enough to fool seasoned educators, then maybe it is time to reevaluate how we assess reflection in the first place. Are we grading sincerity? Emotional language? The presence of keywords like “empathy,” “growth,” or “uncertainty”? If we do not have a robust framework for evaluating whether reflection is actually happening—as an internal, cognitive-emotional process—then it shouldn’t surprise us that AI fake it by just checking the boxes.

Faculty Attitudes: Cautious Curiosity

Another recent study, this one in the Journal of Investigative Medicine by Cervantes et al (2), explored how medical educators are thinking about generative AI more broadly. They did a survey of 250 allopathic and osteopathic medical school faculty at Nova Southeastern University. Their results revealed a mix of excitement and unease. Most saw potential for improving education—particularly in the ability to conduct more efficient research, tutuoring, task automation, and increased content accessibility—but they were also deeply concerned about professionalism, academic integrity, removal of human interaction in important feedback, and overreliance on AI-generated content.

Interestingly, one of the biggest predictors of positive attitudes toward AI was prior use. Faculty who had experimented with ChatGPT or similar tools were more likely to see educational value and less likely to view it as a threat. That tracks with my own anecdotal experience: once people see what AI can do—and just as importantly, what it can’t do—they develop a more nuanced, measured perspective.

Still, the discomfort lingers. If students can generate polished reflections without deep thought, is the assignment still worth doing? Should we redesign reflective writing tasks to include oral defense or peer feedback? Or should we simply accept that AI will be part of the process and shift our focus toward cultivating meaningful inputs rather than fixating on outputs?

What about using AI-augmented reflection?

Let me propose a middle path. What if we reframe AI not as a threat to reflective writing, but as a catalyst? Imagine a student who types out some thoughts after a tough patient encounter, then asks an AI to help clarify or expand them. They read what the AI produces, agree with some parts, reject others, revise accordingly. The final product is stronger—not because AI did the work, but because it facilitated a richer internal dialogue.

That’s not cheating. That’s collaboration. And it’s arguably closer to how most of us write in real life—drafting, editing, bouncing ideas off others (human or machine). Of course, this assumes we teach students to use AI ethically and reflectively, which means we need to model that ourselves. Faculty development around AI literacy is no longer optional. We must move beyond fear-based policies and invest in practical training, guidelines, and conversations that encourage responsible use.

So, where do we go from here?

A few concrete steps seem worth considering:

1.      Redesign reflective assignments. Move beyond short essays. Try audio reflections, peer feedback, or structured prompts that emphasize personal growth over polished prose.

2.      Focus on process, not just product. Ask students to document how they engaged with the reflection—did they use AI? Did they discuss it with a peer or preceptor? Did it change their thinking?

3.      Embrace transparency. Normalize the use of AI in education and ask students to disclose when and how they used it. Make that part of the learning conversation from the beginning.

4.      Invest in AI literacy. Faculty need space and time to learn what these tools can and can’t do. The more familiar we are as faculty, the better we can guide our students.

5.      Stay curious. The technology isn’t going away. The sooner we stop wringing our hands and start asking deeper pedagogical questions, the better positioned we’ll be to adapt with purpose.

In the end, the real question isn’t “Can we tell if a reflection is AI-generated?” It’s “Are we creating learning environments where authentic reflection is valued, supported, and developed—whether or not AI is in the room?” 

If we can answer yes to that, then maybe it doesn’t matter so much who—or what—wrote the first draft.

References

(1)    Wraith C,  Carnegy A,  Brown C,  Baptista A,  Sam AH.  Can educators distinguish between medical student and generative AI-authored reflections? Med Educ.  2025; 1-8. doi:10.1111/medu.15750

(2)    Cervantes J, Smith B, Ramadoss T, D'Amario V, Shoja MM, Rajput V. Decoding medical educators' perceptions on generative artificial intelligence in medical education. J Invest Med. 2024; 72(7): 633-639. doi:10.1177/10815589241257215