Every home is different. That means that to build a useful home robot, we must be able to perform zero-shot manipulation of a wide range of tasks — which is a real challenge for robotics, since so many cutting-edge approaches require expert fine tuning on a small set of in-domain data.
Humanoid company 1X has a solution: world models. The internet is filled with human videos; this has resulted in incredible performance in video models. Why not leverage the semantic and spatial knowledge captured by those video models, to tell robots like the 1X NEO what to do?
1X Director of Evaluations Daniel Ho joins us on RoboPapers to talk about the new work the company is doing in world models, why this is the future, and how to use video models to control a home robot to perform any task.
Watch Episode #61 of RoboPapers, with Michael Cho and Chris Paxton, now!
In their words, from the official 1x blog post:
Many robot foundation models today are vision-language-action models (VLAs), which take a pretrained VLM and add an output head to predict robot actions (PI0.6, Helix, Groot N1.5). VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human. Additionally, auxiliary objectives are often used to further coax spatial reasoning of physical interactions (MolmoAct, Gemini-Robotics 1.5).
Learn more:
Project Page: https://www.1x.tech/discover/world-model-self-learning









