Listen

Description

Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.