Ep#53: Semantic World Models

Description

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable.

Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more.

Watch Episode #53 of RoboPapers now, with Michael Cho and Chris Paxton!

Abstract:

Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as “semantic” world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at this https URL.

Project Page: https://weirdlabuw.github.io/swm/

ArXiV: https://arxiv.org/abs/2510.19818

You may also find this episode interesting, which covers ideas in symbolic learning:

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Listen

Description

Want to check another podcast?