Teaching robots from human video is an important part of overcoming the “data gap” in robotics, but many of the details still need to be worked out. Homanga Bharadwaj tells us about two recent research papers, Gen2Act and Spider, which go over different aspects of the problem:
Gen2Act uses generative video models to create a reference of how a task should be performed given a language prompt; then, it uses a multi-purpose policy that can “translate” from human video to robot motion.
However, Gen2Act has its limitations, in particular when it comes to dexterous, contact-rich tasks. That’s where SPIDER comes in: it uses human data together with simulation to train policies across many different humanoid hands and datasets.
Also of note is that this is our first episode with our new rotating co-host, Jiafei Duan. To learn more, watch Episode #57 of RoboPapers now, with Chris Paxton and Jiafei Duan!
Abstract:
How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
Abstract for SPIDER:
Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality. Due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. We propose Scalable Physics-Informed DExterous Retargeting (
SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences.SPIDERscales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10× faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. By aligning human motion and robot feasibility at scale,SPIDERoffers a general, embodiment-agnostic foundation for humanoid and dexterous hand control. As a universal retargeting method,SPIDERcan work with diverse quality data, including single RGB camera video and can be applied to real robot deployment and other downstream learning methods like RL to enable efficient closed-loop policy learning.
Learn more:
Project Page for Gen2Act: https://homangab.github.io/gen2act/
ArXiV: https://arxiv.org/pdf/2409.16283
Project Page for SPIDER: https://jc-bao.github.io/spider-project/









