When the Harness Carries the Model

Description

Hosts: Lenar Kess, Damra Vol. An open-weights model that fumbles tool calls on its own can go toe to toe with a frontier closed model — once you wrap the right error-handling around it. That gap, between what a model scores and what it does inside your repo, runs through everything we covered today.Ahmad Awais on Latent Space describes "tool confusion" — open models repeating the same invalid tool call roughly fifty-six times per billion tokens — and Command Code's deterministic repair layer that patches malformed output instead of arguing with the model. The claim that reframes the day: the harness, not the weights, decides whether a cheap model is usable.DeepSeek V4 Flash support in llama.cpp (PR #24162) makes the same model runnable locally — but the repair layer that makes it pleasant stays behind Command Code's API. Access to weights isn't access to the experience.Knowledge Activation (Bakal et al.) argues AI skills should be the institutional-knowledge unit for agentic development; Mutation Without Variation warns that repeated LLM edits converge rather than diverge — together a hint that skill files plus a converging model could homogenize a codebase.Agents' Last Exam, SentinelBench, and Stability vs. Manipulability in LLM judges all poke at the same wound: our scores have drifted from the work, especially for long-running and judge-graded evaluation.Anthropic's "When AI builds itself" (via a thin Reddit summary) claims AI is accelerating its own development; a zero-knowledge verification paper offers a cryptographic path to actually check claims like that — and the pause proposals that depend on verification.The Washington Post (Elizabeth Dwoskin), via Techmeme, reports an FDA fast track for digital health tech including AI chatbots — the same model behavior that costs a retry in coding costs a patient in a clinic.

Listen

Description

Want to check another podcast?