Video
Abstract
In this work, we found that during task sub-phases that robot's motion transitions, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.
Introduction
Figure 1 (right) suggests that the vision modality of the vision-proprioception policy fails to take effect during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration between vision and proprioception.
Experiments
We validate the versatility and effectiveness of our proposed GAP algorithm. The evaluations comprehensively cover a wide range of manipulation tasks, including articulated object manipulation, rotation-sensitive tasks, as well as long-horizon and contact-rich tasks.
We conducted comparative analyses between our algorithm and the following baselines:
- MS-Bot: this method uses state tokens with stage information to guide the dynamic collaboration of modalities within multi-modality policy.
- Auxiliary Loss (Aux): following HumanPlus, we use visual feature to predict the next frames as an auxiliary loss, which tries to enhance the vision modality.
- Mask: to prevents the overfitting to specific modality, RDT-1B randomly and independently masks each uni-modal input with a certain probability during encoding. We adapt the algorithm by masking only proprioception modality instead.
handover (4x)
put thermos into bag (4x)
press button
cube
use rag to sweep table (4x)
Conclusion
In this work, we illustrate that the vision modality of the vision-proprioception policy plays a limited role during motion-transition phases due to suppression. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm, enabling dynamic collaboration between vision and proprioception within vision-proprioception policy. We believe this work can offer valuable insights into the development of vision-proprioception policies for robotic manipulation.