Paper Discussed in this AI Journal Club:
Wienholt, P., Caselitz, S., Siepmann, R. et al. Hallucination filtering in radiology vision-language models using discrete semantic entropy. Eur Radiol (2026). https://doi.org/10.1007/s00330-026-12384-z
Episode Summary: In this deep dive, we strip away the marketing hype surrounding medical AI and confront the "black box" problem of Vision Language Models (VLMs) like GPT-4o. We examine a groundbreaking 2026 study published in European Radiology that tackles a terrifying clinical issue: these AI models are incredibly confident, articulate, and often completely wrong. We explore a clever new mathematical wrapper designed to catch the AI in a lie, forcing us to ask: how do we stop the AI from hallucinating with dangerous authority, and can we actually teach it to say "I don't know"?
In This Episode, We Cover:
• The Confident Liar Problem (The Baseline): Why generalist VLMs are fundamentally different from traditional, narrow medical AI. They are probabilistic engines designed to predict the next word, resulting in a dangerous baseline accuracy of just 51.7% on real-world clinical data—essentially a coin flip.
• The Mathematical Lie Detector (Discrete Semantic Entropy): How turning up the AI's "temperature" to 1.0 and asking the exact same question 15 times forces the model to brainstorm, revealing its hidden uncertainties.
• Semantic Clustering (Cutting through the Noise): If the AI says "pneumonia" and then "lung infection," human clinicians know it means the same thing. We discuss how the DSE algorithm groups these answers by their underlying clinical meaning to calculate whether the AI is confidently consistent (low entropy) or randomly guessing (high entropy).
• The Coverage Cost vs. Accuracy Trade-Off: The dramatic results of applying a strict DSE filter. GPT-4o's accuracy jumped from roughly 51% to over 76%, but with a massive catch—it remained completely silent on over half the cases, answering only 47.3% of the clinical questions.
• The Danger Zone (Where AI Fails): Breaking down the performance across modalities. While the AI shone at identifying organs and surprisingly excelled at angiography, it completely fell flat on abnormality detection. On complex 3D CT scans, the filter had to reject over 90% of the questions because the model was fundamentally confused.
• The Trap of the "Confident Hallucination": Why DSE measures consistency, not truth. We explore the nightmare scenario where an AI stubbornly hallucinates the exact same lie 15 times in a row, slipping past the safety filter and creating a massive risk for "automation bias" among clinicians.
• Clinical Feasibility: The surprising practicality of running 15 parallel queries in a real hospital workflow. Because they run simultaneously via an API, the safety check takes only 6 seconds and costs roughly $0.72 per question.
Key Takeaway: Building safer AI might paradoxically risk creating riskier doctors. While Discrete Semantic Entropy successfully filters out the AI's digital noise and confusion—transforming a failing model into a somewhat reliable, albeit very quiet, assistant—it leaves us with a critical human factors challenge. If the system flawlessly cherry-picks the easy cases and stays silent on the hard ones, we must ensure our own diagnostic muscles don't atrophy from over-trusting the machine.