Language models can quote the manual on a bicycle and still miss a broken chain. Beyond Language Modeling: An Exploration of Multimodal Pretraining argues that this is structural, not incidental: text is a lossy compression of reality, and models trained only on it master the description of shadows without seeing the objects casting them. The paper runs controlled, from-scratch pretraining experiments using the Transfusion framework — combining next-token prediction for language with diffusion for vision — across text, image-text pairs, video, and action-conditioned video. The result is four concrete design insights for multimodal architecture, delivered without the confound of inherited language pretraining.