3D perception ability is crucial for generalizable robotic manipulation. While recent foundation models have made significant strides in perception and decision-making with RGB-based input, their lack of 3D perception limits their effectiveness in fine-grained robotic manipulation tasks. To address these limitations, we propose a Depth Information Injection (\( \rm{DI}^{2} \)) framework that leverages the RGB-Depth modality for policy fine-tuning, while relying solely on RGB images for robust and efficient deployment. Concretely, we introduce the Depth Completion Module (DCM) to extract the spatial prior knowledge related to depth information and generate virtual depth information from RGB inputs to aid in policy deployment. Further, we propose the Depth-Aware Codebook (DAC) to eliminate noise and reduce the cumulative error from the depth prediction. In the inference phase, this framework employs both RGB inputs and the accurately predicted depth data to generate the manipulation action. We conduct experiments on simulated LIBERO environments and real-world scenarios, and the experiment results prove that our method could effectively enhance the pre-trained RGB-based policy with 3D perception ability for robotic manipulation.
When the model takes RGB-D as input during training, the entire model can be expressed as follows: $$ \begin{equation} \begin{aligned} a_t &= \pi \left({\rm VLM}\left(f^{text},f_t^{rgb},f_t^{depth}\right)\right), \\ \end{aligned} \end{equation} $$ where \( \pi \) is the policy model to get the final action. We utilize the Vision-Language Model (VLM) to extract features. The \( f^{text} \), \( f_t^{rgb} \) and \( f_t^{depth} \) are extracted feature.
The RGB-D model, effectively utilizes depth images for 3D perception but is limited by its reliance on depth images. To overcome this limitation and enable manipulation in prevalent RGB-only scenarios, we introduce the Depth Completion Module (DCM). This module predicts the depth feature \( \hat{f}_t^{depth} \) from the RGB image feature \( f_t^{rgb} \), incorporating spatial prior knowledge \( P \): $$ \begin{equation} \begin{aligned} \hat{f}_{t}^{depth} = {\rm DCM}\left(f_t^{rgb}, P\right). \end{aligned} \end{equation} $$
To reduce the cumulative errors, we further propose a Depth-Aware Codebook to discretize the depth features predicted by the DCM: $$ \begin{equation} \begin{aligned} \widetilde{f}_{t}^{depth} = {\rm Codebook}\left(\hat{f}_{t}^{depth}\right). \end{aligned} \end{equation} $$
When we only have RGB as the input during inference, we input the \( f^{text} \), \( f_t^{rgb} \), and \( \widetilde{f}_{t}^{depth} \) together into the Vision-Language Model (VLM) to extract features. Ultimately, these features are fed into the policy model \( \pi \) to get the final action.
To get the upper bound of our method, we first conduct a preliminary experiment where the model could obtain both RGB images and depth images as input. We compare the following methods:
1. RGB-RF: The standard RoboFlamingo architecture.
2. RGB-D-RF: Based on RGB-RF, we add an extra branch to extract features from depth images.
3. Data Aug: This method augments the training data by randomly removing RGB modality or depth modality input with a certain probability \( p \).
4. MM Prompt: Building upon the Data Aug method, this method introduces an additional learnable token to indicate the current combination type of input modali- ties. (e.g., RGB-only, Depth-only, or RGB-D)
5. Ours∗: To utilize the ground truth depth image, we replace the DCM described in Section III(see in the paper) with the depth branch as shown in the gray box in Figure 2(see in the paper).
Table I presents success rates of different models on the LIBERO benchmark when provided with RGB-D input. Our method achieves the best overall average success rate of 63.95%, nearly a 6% improvement over the baseline RGB-RF model.
We also compare our results with two cross-modal knowledge distillation methods:
1. CRD: This is a cross-modal knowledge distillation method based on contrastive learning loss
2. CMKD: This is a cross-modal knowledge distilla- tion method based on mean squared error (MSE) loss
The results are as shown in Table II.
The results are as shown in Table IV.
In this paper, we propose the Depth Information Injection (\( {\rm DI}^2 \)) framework. It enhances the performance of pre-trained robot manipulation models that rely solely on RGB inputs by leveraging minimal aligned RGB-D trajectory data. Our framework centers around two primary modules. Firstly, the Depth Completion Module (DCM) integrates spatial prior knowledge derived from depth images in the training trajectories into the model. When operating with only RGB inputs during inference, the DCM leverages learnable tokens alongside RGB image features to accurately predict depth features. Secondly, due to the sequential nature of robot manipulation tasks, using depth features predicted by the DCM directly can cause cumulative errors. This may lead to significant deviations from the intended trajectory. To address this challenge, we introduce the Depth-Aware Codebook. It discretizes the depth features predicted by the DCM, significantly improving prediction accuracy. The \( {\rm DI}^2 \) framework achieved better results in the LIBERO benchmark. Further, the results of the real-world experiments demonstrate the reliability and applicability of our method in practical application scenarios.
@misc{pang2024depthhelpsimprovingpretrained,
title={Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection},
author={Xincheng Pang and Wenke Xia and Zhigang Wang and Bin Zhao and Di Hu and Dong Wang and Xuelong Li},
year={2024},
eprint={2408.05107},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2408.05107},
}