What if the most confident answer in the room is also the most misleading?
Large language models can ace medical exams, yet falter when faced with a real person’s messy, incomplete story. In this episode, we explore how that gap plays out in one of medicine’s highest-stakes decisions: triage. Drawing on Laura’s experience in emergency medicine and Vasanth’s background in AI research, we unpack a new study where laypeople role-played both routine and high-risk conditions and turned to leading LLMs for advice. The surprising twist? Tiny shifts in phrasing produced opposite recommendations—“rest at home” versus “go to the ER”—revealing how sensitive these systems are to prompts, and how an agreeable tone can drown out critical clinical signals.
We take you inside the exam room to contrast what clinicians actually do. Real diagnosis isn’t a single question and answer—it’s an evolving process. Doctors gather a history that unfolds with each response, test competing hypotheses, and scan for subtle red flags and nonverbal cues that never show up in a chat window. From the ominous “worst headache of my life” to abdominal pain that could signal gallstones—or a heart attack—Laura explains how risk-first thinking and strategic follow-ups shape safe decisions. Meanwhile, Vasanth breaks down how preference-tuned models are trained to satisfy users, not challenge them—and why linguistic confidence can increase even as clinical accuracy declines. The study’s findings are sobering: models struggled to identify key conditions, and their triage decisions were no better than basic symptom checkers.
But this isn’t a story of hype or doom—it’s about design. Reliable medical AI must interrogate before it interprets. That means structured red-flag checks, resistance to user-led anchors like “maybe it’s just stress,” and clear, actionable next steps instead of overwhelming option lists. Calibrated uncertainty, transparent reasoning, and human oversight can transform AI from a risky decider into a valuable assistant.
If you care about digital health, safe triage, and the future of human-AI collaboration in medicine, this conversation offers a grounded look at both the limits—and the real promise—of these tools.
If this episode resonated, follow the show, share it with a colleague, and leave a quick review to help more listeners discover Code and Cure.
Reference:
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Andrew M. Bean et al.
Nature Medicine (2026)
Credits:
Theme music: Nowhere Land, Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 4.0
https://creativecommons.org/licenses/by/4.0/