Education in Medicine: Residency Selection in the Age of Artificial Intelligence: Promise and Peril

Residency Selection in the Age of Artificial Intelligence: Promise and Peril

Residency selection has always been a high-stakes, high-stress process. For applicants, it can feel like condensing years of study and service into a few fleeting data points. For programs, it is like drinking from a firehose—thousands of applications to sift through with limited time and resources, and an abiding fear of missing the “right fit.” In recent years, the pressures have only grown: more applications, Step 1 shifting to pass/fail, and increased calls for holistic review in the name of equity and mission alignment.

Into this crucible comes artificial intelligence (AI). Advocates promise that AI can tame the flood of applications, find overlooked gems, and help restore a measure of balance to an overloaded system. Critics worry that it will encode and amplify existing biases, creating new blind spots behind the sheen of algorithmic authority. A set of recent papers provide a window into this crossroads, with one central question: will AI be a tool for fairness, or just a faster way of making the same mistakes?

What We Know About Interviews

Before even considering AI, it helps to step back and look at the traditional residency interview process. Lin and colleagues (1) recently published a systematic review of evidence-based practices for interviewing residency applicants in Journal of Graduate Medical Education. Their review of nearly four decades of research is sobering: most studies are low to moderate quality, and many of our cherished traditions—long unstructured interviews, interviewer “gut feelings”—have little evidence behind them. What does work? Structure helps. The multiple mini interview (MMI) shows validity and reliability. Interviewer training improves consistency. Applicants prefer shorter, one-on-one conversations, and they value time with current residents. Even virtual interviews, despite mixed reviews, save money and broaden access.

In other words, structure beats vibe. If interviews are going to continue as a central part of residency selection, they need to be thoughtfully designed and consistently delivered.

The Scoping Review: AI Arrives

The most important new contribution to this debate is Sumner and colleagues’ scoping review in JGME (2). They examined the small but growing literature on AI in residency application review. Of the twelve studies they found, three-quarters focused on predicting interview offers or rank list positions using machine learning. Three articles used natural language processing (NLP) to review and analyze letters of recommendation.

The results are promising but fragmented. Some models could replicate or even predict program decisions with decent accuracy. Others showed how NLP might highlight subtle patterns in narrative data, such as differences in the language of recommendation letters. But strikingly, only a quarter of the studies explicitly modeled bias. Most acknowledged it as a limitation but stopped short of systematically addressing it. The authors conclude that AI in residency recruitment is here, but it is underdeveloped, under-regulated, and under-evaluated. Without common standards for reporting accuracy, fairness, and transparency, we risk building shiny black boxes that give an illusion of precision while quietly perpetuating inequity.

Early Prototypes in Action

Several studies give us a glimpse of what AI might look like in practice. Burk-Rafel and colleagues at NYU (3) developed a machine learning–based decision support tool, trained on over 8,000 applications across three years of internal medicine interview cycles. The training data included 61 features, including demographics, time since graduation, medical school location, USMLE scores or status, awards (AOA), and publications among many others. Their model achieved an area under the ROC of 0.95 and even performed well (0.94) without USMLE scores. Interestingly, when deployed prospectively, it identified twenty applicants for interview who had been overlooked by human reviewers, many of whom later proved strong candidates. Here, AI wasn’t replacing judgment but augmenting it, catching “diamonds in the rough” that busy faculty reviewers had missed.

Rees and Ryder’s work (4) published in Teaching and Learning in Medicine took a different angle, building machine learning algorithm Random Forest models to predict ranked applicants and matriculants in internal medicine. Their models could predict with high accuracy (area under ROC 0.925) who would be ranked, but struggled to predict who would ultimately matriculate (area under ROC 0.597). The lesson: AI may be able to mimic program decisions, but it is far less certain whether those decisions correlate with outcomes that matter—like performance, retention, or alignment with mission.

Finally, Hassan and colleagues in the Journal of Surgical Education (5) directly compared AI with manual selection of surgical residency applicants. Their findings were provocative: the two applicant lists (AI selected vs PD selected) only had an overlap of 7.4%. AI was able to identify high-performing applicants with efficiency comparable to traditional manual selection, but there were significant differences. The AI selected applicants who were more frequently white/Hispanic (p<0.001), more US medical graduates (p=0.027), younger (p=0.024), and had more publications (p<0.001). This raises questions about both list generation processes. There are questions transparency and acceptance by faculty. Programs faculty trust their own collective wisdom, but will they trust an machine learning process that highlights candidates they initially passed over?

Where AI Could Help

Taken together, these studies suggest that AI could help in several ways:

- Managing volume: AI tools can quickly sort thousands of applications, highlighting candidates who meet baseline thresholds or who might otherwise be filtered out by crude metrics.
- Surfacing hidden talent: By integrating many data points, AI may identify applicants overlooked because of a single weak metric, such as a lower Step score or an atypical background.
- Standardizing review: Algorithms can enforce consistency, reducing the idiosyncrasies of individual reviewers.
- Exposing bias: When designed well, AI can make explicit the patterns of selection, shining light on where programs may unintentionally disadvantage certain groups.

Where AI Could Harm

But the risks are equally real:

- Amplifying bias: Models trained on past decisions will replicate the biases of those decisions. If a program historically favored certain schools or demographics, the algorithm will “learn” to do the same.
- False precision: High AUROC scores may mask the reality that models are only as good as their training data. Predicting interviews is not the same as predicting good residents.
- Transparency and trust: Faculty may resist adopting tools they don’t understand, and applicants may lose faith in a process that feels automated and impersonal.
- Gaming the system: When applicants learn which features are weighted, they may tailor applications to exploit those cues—turning AI from a tool for fairness into just another hoop to jump through.

Broad Reflections: The Future of Recruitment

What emerges from these studies is less a roadmap and more a set of crossroads. Residency recruitment is under enormous pressure. AI offers tantalizing relief, but also real danger.

For programs, the key is humility and intentionality. AI should never completely replace human judgment, but it can augment it. Program directors can use AI to help manage scale, to catch outliers, and to audit their own biases. But the human values—commitment to service, value in diversity, and the mission of training compassionate physicians—cannot be delegated to an algorithm.

For applicants, transparency matters most. A process already viewed as opaque will only grow more fraught if decisions are seen as coming from a black box. Clear communication about how AI is being used, and ongoing study of its impact on residency selection is essential.

For the medical education community, the moment calls for leadership. We need reporting standards for AI models, fairness audits, and shared best practices. Otherwise, each program will reinvent the wheel—and the mistakes.

Residency recruitment has always been an imperfect science, equal parts art and data. AI does not change that. What it does offer is a new lens—a powerful, potentially distorting one. Our task is not to embrace it blindly nor to reject it out of fear, but to use it wisely, always remembering that behind every application is a human being hoping for a chance to serve.

References

(1) Lin JC, Hu DJ, Scott IU, Greenberg PB. Evidence-based practices for interviewing graduate medical education applicants: A systematic review. J Grad Med Educ. 2024; 16 (2): 151-165.

(2) Sumner MD, Howell TC, Soto AL, et al. The use of artificial intelligence in residency application evaluation: A scoping review. J Grad Med Educ. 2025; 17 (3): 308-319.

(3) Burk-Rafel J, Reinstein I, Feng J, et al. Development and validation of a machine learning–based decision support tool for residency applicant screening and review. Acad Med. 2021; 96 (11S): S54-S61.

(4) Rees CA, Ryder HF. Machine learning for the prediction of ranked applicants and matriculants to an internal medicine residency program. Teach Learn Med. 2022; 35 (3): 277-286.

(5) Hassan S, et al. Artificial intelligence compared to manual selection of prospective surgical residents. J Surg Educ. 2025; 82 (1): 103308.

Education in Medicine

Introduction

Wednesday, September 3, 2025

Residency Selection in the Age of Artificial Intelligence: Promise and Peril

No comments:

Post a Comment

Total Pageviews