Listen

Description

Learning a true world model for a human body means taking high-dimensional actions representing the full body pose — the location of hands and feet, for example — and using it to predict the effects of each action. This would allow for an unprecedented level of simulation over the effects of each action on the world, but this level of information is usually not available.

But, with a new dataset from Meta, Yutong Bai and co-authors were able to train just such a world model, using detailed 3d information of whole human bodies in different apartments, and predicting the results of granular actions.

Watch Episode#36 of RoboPapers, co-hosted by Michael Cho and Chris Paxton, now to find out more.

Abstract:

We train models to Predict Egocentric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Project Site

ArXiV

Thread on X



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com