Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.
Want to check another podcast?
Enter the RSS feed of a podcast, and see all of their public statistics.