Residency Selection in the Age of Artificial Intelligence: Promise and Peril
Residency selection has always been a high-stakes,
high-stress process. For applicants, it can feel like condensing years of study
and service into a few fleeting data points. For programs, it is like drinking
from a firehose—thousands of applications to sift through with limited time and
resources, and an abiding fear of missing the “right fit.” In recent years, the
pressures have only grown: more applications, Step 1 shifting to pass/fail, and
increased calls for holistic review in the name of equity and mission
alignment.
Into this crucible comes artificial intelligence (AI).
Advocates promise that AI can tame the flood of applications, find overlooked
gems, and help restore a measure of balance to an overloaded system. Critics
worry that it will encode and amplify existing biases, creating new blind spots
behind the sheen of algorithmic authority. A set of recent papers provide a
window into this crossroads, with one central question: will AI be a tool for
fairness, or just a faster way of making the same mistakes?
What We Know About Interviews
Before even considering AI, it helps to step back and
look at the traditional residency interview process. Lin and colleagues (1)
recently published a systematic review of evidence-based practices for
interviewing residency applicants in Journal of Graduate Medical Education.
Their review of nearly four decades of research is sobering: most studies are
low to moderate quality, and many of our cherished traditions—long unstructured
interviews, interviewer “gut feelings”—have little evidence behind them. What
does work? Structure helps. The multiple mini interview (MMI) shows validity
and reliability. Interviewer training improves consistency. Applicants prefer
shorter, one-on-one conversations, and they value time with current residents.
Even virtual interviews, despite mixed reviews, save money and broaden access.
In other words, structure beats vibe. If interviews
are going to continue as a central part of residency selection, they need to be
thoughtfully designed and consistently delivered.
The Scoping Review: AI Arrives
The most important new contribution to this debate is
Sumner and colleagues’ scoping review in JGME (2). They examined the small but
growing literature on AI in residency application review. Of the twelve studies
they found, three-quarters focused on predicting interview offers or rank list
positions using machine learning. Three
articles used natural language processing (NLP) to review and analyze letters
of recommendation.
The results are promising but fragmented. Some models
could replicate or even predict program decisions with decent accuracy. Others
showed how NLP might highlight subtle patterns in narrative data, such as
differences in the language of recommendation letters. But strikingly, only a
quarter of the studies explicitly modeled bias. Most acknowledged it as a
limitation but stopped short of systematically addressing it. The authors
conclude that AI in residency recruitment is here, but it is underdeveloped,
under-regulated, and under-evaluated. Without common standards for reporting
accuracy, fairness, and transparency, we risk building shiny black boxes that
give an illusion of precision while quietly perpetuating inequity.
Early Prototypes in Action
Several studies give us a glimpse of what AI might
look like in practice. Burk-Rafel and colleagues at NYU (3) developed a machine
learning–based decision support tool, trained on over 8,000 applications across
three years of internal medicine interview cycles. The training data included 61
features, including demographics, time since graduation, medical school
location, USMLE scores or status, awards (AOA), and publications among many
others. Their model achieved an area under the ROC of 0.95 and even performed
well (0.94) without USMLE scores. Interestingly, when deployed prospectively,
it identified twenty applicants for interview who had been overlooked by human reviewers,
many of whom later proved strong candidates. Here, AI wasn’t replacing judgment
but augmenting it, catching “diamonds in the rough” that busy faculty reviewers
had missed.
Rees and Ryder’s work (4) published in Teaching and
Learning in Medicine took a different angle, building machine learning algorithm
Random Forest models to predict ranked applicants and matriculants in internal
medicine. Their models could predict with high accuracy (area under ROC 0.925)
who would be ranked, but struggled to predict who would ultimately matriculate
(area under ROC 0.597). The lesson: AI may be able to mimic program decisions,
but it is far less certain whether those decisions correlate with outcomes that
matter—like performance, retention, or alignment with mission.
Finally, Hassan and colleagues in the Journal of
Surgical Education (5) directly compared AI with manual selection of surgical
residency applicants. Their findings were provocative: the two applicant lists
(AI selected vs PD selected) only had an overlap of 7.4%. AI was able to
identify high-performing applicants with efficiency comparable to traditional manual
selection, but there were significant differences. The AI selected applicants who
were more frequently white/Hispanic (p<0.001), more US medical graduates
(p=0.027), younger (p=0.024), and had more publications (p<0.001). This
raises questions about both list generation processes. There are questions
transparency and acceptance by faculty. Programs faculty trust their own
collective wisdom, but will they trust an machine learning process that
highlights candidates they initially passed over?
Where AI Could Help
Taken together, these studies suggest that AI could
help in several ways:
- Managing volume: AI tools can quickly sort thousands
of applications, highlighting candidates who meet baseline thresholds or who
might otherwise be filtered out by crude metrics.
- Surfacing hidden talent: By integrating many data points, AI may identify
applicants overlooked because of a single weak metric, such as a lower Step
score or an atypical background.
- Standardizing review: Algorithms can enforce consistency, reducing the
idiosyncrasies of individual reviewers.
- Exposing bias: When designed well, AI can make explicit the patterns of
selection, shining light on where programs may unintentionally disadvantage
certain groups.
Where AI Could Harm
But the risks are equally real:
- Amplifying bias: Models trained on past decisions
will replicate the biases of those decisions. If a program historically favored
certain schools or demographics, the algorithm will “learn” to do the same.
- False precision: High AUROC scores may mask the reality that models are only
as good as their training data. Predicting interviews is not the same as
predicting good residents.
- Transparency and trust: Faculty may resist adopting tools they don’t
understand, and applicants may lose faith in a process that feels automated and
impersonal.
- Gaming the system: When applicants learn which features are weighted, they
may tailor applications to exploit those cues—turning AI from a tool for
fairness into just another hoop to jump through.
Broad Reflections: The Future of Recruitment
What emerges from these studies is less a roadmap and
more a set of crossroads. Residency recruitment is under enormous pressure. AI
offers tantalizing relief, but also real danger.
For programs, the key is humility and intentionality.
AI should never completely replace human judgment, but it can augment it.
Program directors can use AI to help manage scale, to catch outliers, and to
audit their own biases. But the human values—commitment to service, value in diversity,
and the mission of training compassionate physicians—cannot be delegated to an
algorithm.
For applicants, transparency matters most. A process
already viewed as opaque will only grow more fraught if decisions are seen as
coming from a black box. Clear communication about how AI is being used, and
ongoing study of its impact on residency selection is essential.
For the medical education community, the moment calls
for leadership. We need reporting standards for AI models, fairness audits, and
shared best practices. Otherwise, each program will reinvent the wheel—and the
mistakes.
Residency recruitment has always been an imperfect
science, equal parts art and data. AI does not change that. What it does offer
is a new lens—a powerful, potentially distorting one. Our task is not to
embrace it blindly nor to reject it out of fear, but to use it wisely, always
remembering that behind every application is a human being hoping for a chance
to serve.
References
(1) Lin JC, Hu DJ, Scott IU, Greenberg PB. Evidence-based practices for interviewing
graduate medical education applicants: A systematic review. J Grad Med Educ.
2024; 16 (2): 151-165.
(2) Sumner MD, Howell TC, Soto AL, et al. The use of
artificial intelligence in residency application evaluation: A scoping review. J
Grad Med Educ. 2025; 17 (3): 308-319.
(3) Burk-Rafel J, Reinstein I, Feng J, et al.
Development and validation of a machine learning–based decision support tool
for residency applicant screening and review. Acad Med. 2021; 96 (11S): S54-S61.
(4) Rees CA, Ryder HF. Machine learning for the
prediction of ranked applicants and matriculants to an internal medicine
residency program. Teach Learn Med. 2022; 35 (3): 277-286.
(5) Hassan S, et al. Artificial intelligence compared
to manual selection of prospective surgical residents. J Surg Educ.
2025; 82 (1): 103308.
No comments:
Post a Comment