HRP: Human Affordances for Robotic Pre-Training

Published in Robotics: Science and Systems, 2024

Mohan Kumar Srirama, Sudeep Dasari,^* Shikhar Bahl,^* Abhinav Gupta^*

^*Equal advising
Carnegie Mellon University

TL-DR: We present a new human affordance based visual pre-training scheme that boosts the performance of any existing pre-trained visual representation!

HRP: Model

HRP fine-tunes a pre-trained encoder to predict three classes of human affordance labels via L2 regression. Specifically, the network must predict future contact points, human hand poses, and the target object given an input frame from the video stream. These affordance labels are mined autonomously from a human video dataset using off-the-shelf vision detectors. HRP representations are then fine-tuned to solve downstream manipulation tasks via behavior cloning.

Policy Trianing

We present our policy training pipeline, which uses Behavior Cloning (BC) to train policy π, using optimal expert demonstrations. The image observation (ot) is processed using our HRP representations resulting in a latent vector z. The policy uses z to predict end-effector velocity actions (delta ee-pose/gripper), which are directly executed on the robot during test-time.

Robot Tasks

We benchmark HRP on 5 Tasks, 3 robot morphologies

Quantitative Results

We apply HRP to 6 different baseline representations and plot how it affects performance on average across the toasting, pouring, and stacking tasks. We evaluate the performance across two distinct cameras in order to test if HRP representation are robust to view shifts. We find that HRP representations consistently and substantially outperform their vanilla baselines, and that this effect holds across both the front (left) and ego (right) cameras. In fact, our strongest representation ImageNet + HRP delivers SOTA performance on both views!

Ablations

We ablate HRP (full fine-tuning) to the 6 baseline representations, comparing their average performance versus standard HRP representations on toasting, pouring, and stacking tasks. LayerNorm only fine-tuning is almost always superior. We also drop each of the 3 losses in HRP to compare the ablated method’s average performance against full HRP representations across the toasting, pouring, and stacking tasks. This experiment is conducted on the Ego4D, ImageNet, and VC-1 base models. Object and hand losses are found to be critical for good performance, while the contact loss significantly impacts only the Ego4D base model.

BibTeX

@inproceedings{srirama2024hrp,
      title={HRP: Human Affordances for Robotic Pre-Training},
      author = {Mohan Kumar Srirama and Sudeep Dasari and Shikhar Bahl and Abhinav Gupta},
      booktitle = {Proceedings of Robotics: Science and Systems},
    address  = {Delft, Netherlands},
    year = {2024},
    }