Robotic Policy Learning via Human-assisted Action Preference Optimization

Abstract

Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their dependence on expert demonstrations hinders the crucial capabilities of correction and learning from failures. To mitigate this limitation, we introduce a Human-assisted Action Preference Optimization method named HAPO, designed to correct deployment failures and foster effective adaptation through preference alignment for VLA models. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. These human-intervention trajectories are further employed within the action preference optimization process, facilitating VLA models to mitigate failure action occurrences while enhancing corrective action adaptation. Specifically, we propose an adaptive reweighting algorithm to address the issues of irreversible interactions and token probability mismatch when introducing preference optimization into VLA models, facilitating model learning from binary desirability signals derived from interactions. Through combining these modules, our human-assisted action preference optimization method ensures reliable deployment and effective learning from failure for VLA models. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our framework across a variety of manipulation tasks.

Introduction

Figure 1: Our pipeline for human-assisted action preference optimization.

Ensuring safe interactions in unconstrained environments while fostering continuous improvement is crucial for the development of robust robotic manipulation systems in real-world scenarios. Benefiting from the capacity for generalizable reasoning and scalable learning, Vision-Language-Action (VLA) models have been widely recognized as the foundation model for such robotic deployment systems. However, their performance in achieving high field-ready success rates in unconstrained, unpredictable real-world environments remains a significant limitation. This discrepancy presents a key challenge: how to integrate these developing Vision-Language-Action models into practical scenarios?

To enable reliable deployment and stable learning from interactions, we propose a Human-assisted Action Preference Optimization method named HAPO, for autoregressive VLA models. This method integrates two critical components: the human-robot collaboration framework for reliable deployment, and the action preference optimization process for iterative improvement of VLA models. As shown in Figure 1(a), the human-robot collaboration deployment framework allows real-time human interventions during policy execution, ensuring reliable task completion when the robot encounters challenging situations. To mitigate the proportion imbalance of corrective action, we propose a balanced sampling method to provide proportional representation of interaction data for further VLA preference optimization. As shown in Figure 1(b), we introduce the action preference optimization process to fully leverage the sub-optimal interaction trajectories for stable VLA models optimization, which helps avoid failure actions and encourages the adoption of corrective actions. Through iterative human-robot collaboration deployment and action preference optimization process, our method can continuously enhance the VLA model's capabilities via environment interaction, ensuring sustained improvements in performance and adaptability to dynamic downstream manipulation tasks.

Method

Human-robot Collaboration Deployment

We first collect an expert demonstration dataset \(\mathcal{D}_e = \{\tau_e^i\}_{i=1}^{i=N}\), where each trajectory \(\tau_e^i = \{(o_t^i,a_t^i,c_t^i)\}_{t=1}^{t=T}\) consists of observation-action pairs with expert annotations. Here, \(c_t^i = 1\) indicates that action \(a_t^i\) is executed by a human expert. We employ behavior cloning to fine-tune the pretrained VLA model on these expert demonstrations, obtaining an initial base policy \(\pi_\theta^0\).

During policy execution, human operators monitor the process and intervene when encountering challenging scenarios. This allows us to collect interaction trajectories \(\mathcal{D}_h = \{\tau^{i}_{h}\}_{i=1}^{i=M}\), where \(c_t^i = 2\) represents human-corrected actions and \(c^i_t = 1\) denotes policy-executed actions. We re-label trajectories by categorizing actions in the \(K\) steps preceding human interventions as undesirable (\(c_t^i=0\)). Finally, we combine the expert demonstrations \(\mathcal{D}_e\) and interaction dataset \(\mathcal{D}_h\) for robotic action preference optimization. The pipeline is illustrated in the Deployment function within Algorithm 1

Human-Robot Collaboration Demo

Loading video...

Action Preference Optimization

Although previous Reinforcement Learning with Human Feedback (RLHF) methods have proven effective in LLM fine-tuning, there are additional challenges for the VLA models preference optimization in robotic manipulation:

The irreversible robotic manipulation process makes it challenging to acquire meaningful paired positive-negative actions under the same observational conditions.
The mapping of continuous robotic actions to discrete tokens by autoregressive VLAs causes a mismatch between token probability and continuous action errors, complicating preference optimization in action token prediction.

To address these fundamental challenges, we leverage Kahneman & Tversky's prospect theory as the theoretical foundation for preference alignment optimization with binary desirability signals. Building upon this framework, we propose an adaptive reweighting method specifically designed to bridge the critical gap between discrete token prediction and continuous action regression in robotic manipulation tasks.

Our adaptive reweighting approach intelligently guides the model to prioritize training samples that exhibit significant regression errors. The method operates by first estimating the L1 loss of the continuous action l for each sample, then dynamically adjusting the sample weights during training to focus computational resources on the most challenging cases.

The mathematical formulation and implementation details of our adaptive reweighting method are comprehensively presented in the Optimization function within Algorithm 1.

Our Human-assisted Action Preference Optimization Method

Experiment

Comparison Results

Table 1: Comparison experiment results across 4 manipulation tasks in RoboMimic Simulation.

Compared with behavior cloning objective methods, our method could fully leverage the sub-optimal interaction trajectories. Among all compared preference learning based methods, our method utilizes KL divergence to estimate the mean margin between the updated model and the reference model, which not only enables more stable learning but also better preserves prior knowledge. Furhter, our method leverages the adaptive reweighting method to achieve more precise control over the importance weights of both positive and negative samples, delivering more notable performance improvements.

Generalization to Novel Scenarios

Figure 2: The demonstrations of novel scenarios.

Our objective is to develop a human-assisted action preference optimization method that facilitates continuous improvement, enabling performance enhancements in novel disruption scenarios while retaining original task capabilities during model fine-tuning. Thus, we evaluate the performance of the fine-tuned model across both disruption scenarios and original scenarios. The results prove that our approach can effectively adapt to new disruption scenarios through adaptive reweighting.

Lifelong Learning

Figure 3: Lifelong learning results.

Our method achieves superior performance compared to the baseline, demonstrating its ability to effectively leverage sub-optimal human intervention trajectories for iterative model improvement.

Real-world Experiments

Figure 4: The real-world experiments.

In this work, we conduct the challenging fine-grained robotic manipulation task "Insert the square into the stick" as shown in Figure 4(a), which requires the robot to grasp the square and precisely insert into the stick. As demonstrated in Table 4, our method demonstrated robust adaptability to these downstream disruption scenarios. The results empirically validate the method's practical utility for real-world deployment in unstructured environments.

Table 4: The real-world experiments results.

Experimental Results Demo

Loading video...

Conclusion

In this work, we introduce the human-assisted action preference optimization method, consisting of two critical components: a human-robot collaboration framework for reliable deployment and an action preference optimization process with adaptive reweighting for stable VLA model optimization. Through our method, we could promote continuous improvement during the deployment of VLA models. We hope our method could bring insights for efficient and effective VLA model adaptation on downstream manipulation tasks.