In this episode of My Weird Prompts, hosts Herman and Corn dive into the cutting-edge landscape of 2026’s video-based multimodal AI. They explore how the industry moved beyond simple frame-sampling to adopt spatial-temporal tokenization, allowing models to treat time as a physical dimension. The discussion covers the technical hurdles of real-time video-to-video interaction, including Simultaneous Localization and Mapping (SLAM) for floor plan generation and the use of speculative decoding to minimize latency. By examining the integration of Neural Radiance Fields (NeRFs) and native multimodality, Herman and Corn reveal how AI is finally crossing the uncanny valley to create digital avatars that are indistinguishable from reality.