Twenty Ways To Not Trust An Agent

Description

Hosts: Lenar Kess, Damra Vol. One morning's arXiv listing dropped close to twenty agent papers, and almost none of them are about making agents more capable. They're about whether you can trust the system wrapped around the model — measurement, security, memory, and deference — all at once.Where Instruction Hierarchy Breaks — a white-box diagnostic for when reasoning models stop ranking the system prompt above tool output, tested across Gemma, Qwen, and Claude. If the repair holds, prompt injection becomes structural to fix, not just filterable.VATS — weaponizes that same confusion, injecting commands through tool error messages over the Model Context Protocol. The error path is the door most teams never locked.Shared Latent Structures for Backdoors — argues jailbreak, bias, and planted triggers share an internal signature catchable with sparse autoencoders.Beyond Goodhart's Law (MAC-Bench), Online Agent-as-a-Judge, and PACE — three attempts to keep evaluation honest when the thing you're testing can learn the test.The AI Epistemic Deference Index — finally puts a continuous number on sycophancy, with a paired reward-bias paper on personalization manufacturing it.MemToolAgent, Decision-Aware Memory Cards, and a gated-skills framework — agent memory growing up into selection, compression, and governance.Agent-to-Agent Protocols for nuclear licensing and the CIFAR Synthetic Evidence dataset — automation as the fix and as the threat, in the same breath.Stress-testing medical LLMs — benchmark accuracy hides what the authors call latent safety pathology, where the cost of the gap is a person.

Listen

Description

Want to check another podcast?