0:00
/
0:00
Transcript

Ep#5: R+X: Retrieval and Execution from Everyday Human Videos

With Georgios Papagiannis and Norman Di Palo

Human data is much more plentiful than robot data, and humans already know how to perform so many tasks. Teaching robots from human videos, then, has a ton of potential.

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at this https URL.

Find the paper on ArXiV.

Discussion about this video

User's avatar

Ready for more?