When would Vision-Proprioception Policies Fail in Robotic Manipulation?

International Conference on Learning Representations (ICLR) 2026

Jingxian Lu1,2,*, Wenke Xia1,2,*, Yuxuan Wu3, Zhiwu Lu1,2, Di Hu1,2,✉
1Gaoling School of Artificial Intelligence, Renmin University of China
2Beijing Key Laboratory of Research on Large Models and Intelligent Governance
3School of Artificial Intelligence, Beihang University
* Equal contribution, ✉ Corresponding author

Video

Abstract

In this work, we found that during task sub-phases that robot's motion transitions, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.

Introduction

Generalization of vision-proprioception policies
Figure 1: The generalization of vision-proprioception policies. (left) Vision-Proprioception policies perform 15.8% worse than Vision-only policies. (right) We explore this through intervening the task execution of vision-only policy during different periods, by switching to vision-proprioception policy. Such intervention has minimal impact during motion-consistent phases like "move forward". However, during motion-transition phases like "locate base" and "assemble them", switching leads to noticeable degradation, indicating the vision modality fails to take effect during these phases.

Figure 1 (right) suggests that the vision modality of the vision-proprioception policy fails to take effect during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration between vision and proprioception.

The pipeline of Gradient Adjustment with Phase-guidance (GAP) algorithm.
Figure 2: The pipeline of Gradient Adjustment with Phase-guidance (GAP) algorithm.

Experiments

We validate the versatility and effectiveness of our proposed GAP algorithm. The evaluations comprehensively cover a wide range of manipulation tasks, including articulated object manipulation, rotation-sensitive tasks, as well as long-horizon and contact-rich tasks.

We conducted comparative analyses between our algorithm and the following baselines:

  • MS-Bot: this method uses state tokens with stage information to guide the dynamic collaboration of modalities within multi-modality policy.
  • Auxiliary Loss (Aux): following HumanPlus, we use visual feature to predict the next frames as an auxiliary loss, which tries to enhance the vision modality.
  • Mask: to prevents the overfitting to specific modality, RDT-1B randomly and independently masks each uni-modal input with a certain probability during encoding. We adapt the algorithm by masking only proprioception modality instead.
Comparative Results
Table 1: Comparisons with other methods in both simulated and real-world environments. Average success rate and standard deviation of simulation results are calculated over 5 seeds. The vision-proprioception policies after our gradient adjustment significantly outperform other methods.

handover (4x)

put thermos into bag (4x)

press button

cube

use rag to sweep table (4x)

Conclusion

In this work, we illustrate that the vision modality of the vision-proprioception policy plays a limited role during motion-transition phases due to suppression. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm, enabling dynamic collaboration between vision and proprioception within vision-proprioception policy. We believe this work can offer valuable insights into the development of vision-proprioception policies for robotic manipulation.