Robotic Policy Learning via Human-assisted Action Preference Optimization

Human-assisted Robotic Policy Refinement via
Action Preference Optimization

¹Gaoling School of Artificial Intelligence, Renmin University of China ²ByteDance Seed ³Beijing Key Laboratory of Research on Large Models and Intelligent Governance ⁴Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE

* work is done during internship at ByteDance Seed, † Corresponding author

Abstract

Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for post-deployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration.

Introduction

Figure 1: Our pipeline for action preference optimization.

Ensuring safe interactions in unconstrained environments while fostering continuous improvement is crucial for the development of robust robotic manipulation systems in real-world scenarios. Benefiting from the capacity for generalizable reasoning and scalable learning, Vision-Language-Action (VLA) models have been widely recognized as the foundation model for such robotic deployment systems. However, their performance in achieving high field-ready success rates in unconstrained, unpredictable real-world environments remains a significant limitation. This discrepancy presents a key challenge: how to integrate these developing Vision-Language-Action models into practical scenarios?

To enable reliable deployment and stable learning from interactions, we propose the Action Preference Optimization method named APO, for autoregressive VLA models. This method integrates two critical components: the human-robot collaboration framework for reliable deployment, and the action preference optimization process for iterative improvement of VLA models. As shown in Figure 1(a), the human-robot collaboration deployment framework allows real-time human interventions during policy execution, ensuring reliable task completion when the robot encounters challenging situations. To mitigate the proportion imbalance of corrective action, we propose a balanced sampling method to provide proportional representation of interaction data for further VLA preference optimization. As shown in Figure 1(b), we introduce the action preference optimization process to fully leverage the sub-optimal interaction trajectories for stable VLA models optimization, which helps avoid failure actions and encourages the adoption of corrective actions. Through iterative human-robot collaboration deployment and action preference optimization process, our method can continuously enhance the VLA model's capabilities via environment interaction, ensuring sustained improvements in performance and adaptability to dynamic downstream manipulation tasks.

Method

Human-robot Collaboration Deployment

We first collect an expert demonstration dataset \(\mathcal{D}_e = \{\tau_e^i\}_{i=1}^{i=N}\), where each trajectory \(\tau_e^i = \{(o_t^i,a_t^i,c_t^i)\}_{t=1}^{t=T}\) consists of observation-action pairs with expert annotations. Here, \(c_t^i = 1\) indicates that action \(a_t^i\) is executed by a human expert. We employ behavior cloning to fine-tune the pretrained VLA model on these expert demonstrations, obtaining an initial base policy \(\pi_\theta^0\).

During policy execution, human operators monitor the process and intervene when encountering challenging scenarios. This allows us to collect interaction trajectories \(\mathcal{D}_h = \{\tau^{i}_{h}\}_{i=1}^{i=M}\), where \(c_t^i = 2\) represents human-corrected actions and \(c^i_t = 1\) denotes policy-executed actions. We re-label trajectories by categorizing actions in the \(K\) steps preceding human interventions as undesirable (\(c_t^i=0\)). Finally, we combine the expert demonstrations \(\mathcal{D}_e\) and interaction dataset \(\mathcal{D}_h\) for robotic action preference optimization. The pipeline is illustrated in the Deployment function within Algorithm 1

Human-Robot Collaboration Demo

Loading video...

Action Preference Optimization

Although previous Reinforcement Learning with Human Feedback (RLHF) methods have proven effective in LLM fine-tuning, there are additional challenges for the VLA models preference optimization in robotic manipulation:

The irreversible robotic manipulation process makes it challenging to acquire meaningful paired positive-negative actions under the same observational conditions.
The mapping of continuous robotic actions to discrete tokens by autoregressive VLAs causes a mismatch between token probability and continuous action errors, complicating preference optimization in action token prediction.

To address these fundamental challenges, we leverage Kahneman & Tversky's prospect theory as the theoretical foundation for preference alignment optimization with binary desirability signals. Building upon this framework, we propose an adaptive reweighting method specifically designed to bridge the critical gap between discrete token prediction and continuous action regression in robotic manipulation tasks.

Our adaptive reweighting approach intelligently guides the model to prioritize training samples that exhibit significant regression errors. The method operates by first estimating the L1 loss of the continuous action l for each sample, then dynamically adjusting the sample weights during training to focus computational resources on the most challenging cases.

The mathematical formulation and implementation details of our adaptive reweighting method are comprehensively presented in the Optimization function within Algorithm 1.

Our Human-assisted Action Preference Optimization Method

Experiment

Comparison Results

Table 1: Comparison experiment results across 4 manipulation tasks in RoboMimic Simulation.

Compared with behavior cloning objective methods, our method could fully leverage the sub-optimal interaction trajectories. Among all compared preference learning based methods, our method utilizes KL divergence to estimate the mean margin between the updated model and the reference model, which not only enables more stable learning but also better preserves prior knowledge. Furhter, our method leverages the adaptive reweighting method to achieve more precise control over the importance weights of both positive and negative samples, delivering more notable performance improvements.

Generalization to Novel Scenarios

Figure 2: The demonstrations of novel scenarios.

Our objective is to develop a human-assisted action preference optimization method that facilitates continuous improvement, enabling performance enhancements in novel disruption scenarios while retaining original task capabilities during model fine-tuning. Thus, we evaluate the performance of the fine-tuned model across both disruption scenarios and original scenarios. The results prove that our approach can effectively adapt to new disruption scenarios through adaptive reweighting.

Lifelong Learning

Figure 3: Lifelong learning results.

Our method achieves superior performance compared to the baseline, demonstrating its ability to effectively leverage sub-optimal human intervention trajectories for iterative model improvement.

Real-world Experiments

Figure 4: The real-world experiments.

In this work, we conduct the challenging fine-grained robotic manipulation task "Insert the square into the stick" as shown in Figure 4(a), which requires the robot to grasp the square and precisely insert into the stick. As demonstrated in Table 4, our method demonstrated robust adaptability to these downstream disruption scenarios. The results empirically validate the method's practical utility for real-world deployment in unstructured environments.

Table 4: The real-world experiments results.

Experimental Results Demo

Loading video...

Citation

@article{xia2025robotic,
  title={Human-assisted Robotic Policy Refinement via Action Preference Optimization},
  author={Xia, Wenke and Yang, Yichu and Wu, Hongtao and Ma, Xiao and Kong, Tao and Hu, Di},
  journal={arXiv preprint arXiv:2506.07127},
  year={2025}
}