Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.

In a complex manipulation task, the importance of various uni-modal features could change over stages. At timesteps from different stages, a particular modality may contribute significantly to the prediction, or serve as a supplementary role to the primary modality, or provide little useful information. Using the pouring task as an example, vision plays a dominant role in the Aligning stage. Once in the Start Pouring stage, the model begin to use audio and tactile feedback to determine the appropriate pouring angle. During the Holding Still stage, the model primarily relies on audio and tactile deformation to assess the mass of beads. In the final End Pouring stage, the model discerns the completion of the pouring mainly based on tactile deformation. Moreover, different states within a stage, such as the beginning and end, may also exhibit minor changes in modality importance. We distinguish them as coarse-grained and fine-grained importance change, and summarize this as a challenge in multi-sensory imitation learning: Modality Temporality.

To deal with the above challenge, we propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding. We first add a stage label $ s_t $ for each sample to form $ (\textbf{X}_t, a_t, s_t) $, where $ \textbf{X}_t $ is the multi-sensory observation at timestep $ t $ and $ a_t $ is the action label. We then train the MS-Bot, which consists of four components:

Feature Extractor: This component consists of several uni-modal encoders. Each encoder takes a brief history of observations $ X_t^m \in \mathbb{R}^{T\times H_m \times W_m \times C_m} $ of the modality $ m $ as input, where $ T $ is the timestep number of the brief history and $ H_m, W_m, C_m $ indicates the input shape of modality $ m $. These observations are then encoded into feature tokens $ \mathbf{f}_t \in \mathbb{R}^{M\times T \times d} $ where $ d $ denotes dimension.
State Tokenizer: This component aims to encode the observations and action history $ (a_1, a_2,...,a_{t-1}) $ into a token that can represent the current state. Action history is similar to human memory and can help to indicate the current state within the whole task. We input the action history as a one-hot sequence into an LSTM, then concatenate the output with the feature tokens and encode them into a state token $ z^{state}_t $ through a Multi-Layer Perceptron (MLP).
Stage Comprehension Module: This module aims to perform coarse-to-grained stage understanding by injecting stage information into the state token. For a task with $S$ stages, we use $ S $ learnable stage tokens $ [stage_1],...,[stage_S] $ to represent each stage. We then use a gate network (MLP) to predict the current stage, then multiply the softmax score $ \mathbf{g}_t $ with the stage tokens and sum them up to obtain the stage token $ z^{stage}_t $ at timestep $ t $: $$ \begin{equation} \begin{gathered} \mathbf{g}_t = (g^{1}_t,...,g^{S}_t) = softmax (MLP (z^{state}_t)),\\ z^{stage}_t = \frac{1}{S} \sum_{j=1}^S(g^{j}_t \cdot [stage_j]). \end{gathered} \end{equation} $$ Finally, we compute the weighted sum of the state token $z^{state}_t$ and the current stage token $z^{stage}_t$ using a weight $\beta$ to obtain the stage-injected state token $z^{*}_t$: \begin{equation} z^{*}_t = \beta \cdot z^{state}_t + (1-\beta) \cdot z_t^{stage}. \end{equation} Different from the old state token $z^{state}_t$, the new state token $z^{*}_t$ represents the fine-grained state within a stage. In this case, $z^{stage}_t$ is regarded as an anchor stage, while $z^{state}_t$ can indicate the shift inside the stage, thereby achieving coarse-to-fine stage comprehension. During the training process, we utilize stage labels to supervise the stage scores output by the gate net. We use a soft penalty loss $\mathcal{L}_{gate}$ to constrain the output of the gate net on the $i$-th sample: \begin{equation} \begin{gathered} \mathcal{L}_{gate,i} = \sum_{j=1}^S (w_i^j\cdot g_i^j), \ j \in \{ 1,2,...,S \}, \\ w_i^j = \left \{ \begin{array}{ll} 0, & (s_i = j) \ or \ (\exists k,\ |k-i| \leq \gamma,\ s_i \ne s_k) , \\ 1, & otherwise, \end{array} \right. \\ \end{gathered} \end{equation} where $k$ indicates a nearby sample in the same trajectory, $s_i$ and $s_k$ represent stage labels and $\gamma$ is a hyper-parameter used to determine the range near the stage boundaries.
Dynamic Fusion Module: We aim for this module to dynamically select the modalities of interest based on the fine-grained state within the current stage. We use the state token with stage information $z^{*}_t$ as query, and the feature tokens $\mathbf{f}_t$ as key and value for cross-attention. The features from all modalities are integrated into a fusion token $z^{fus}_t$ based on the current stage's requirements. Finally, the fusion token $z^{fus}_t$ is fed into an MLP to predict the next action $a_t$. We also introduce random attention blur by replacing the attention scores on feature tokens with the same average value $\frac{1}{M\times T}$ with a probability $p$ to prevent the model from simply memorizing the actions corresponding to attention score patterns.

We evaluate our method on two challenging robotic manipulation tasks: pouring and peg insertion with keyway. We compare our method with three baselines in both tasks:

Concat: a model which directly concatenates all the uni-modal features.
Du et al.: a model that uses LSTM to fuse the uni-modal features and the additional proprioceptive information.
MULSA: a model which fuses the uni-modal features via self-attention.

We also compare our method with two variants in each task:

MS-Bot (w/o A/D): removing audio in pouring and depth in peg insertion for MS-Bot.
MS-Bot (w/o T/R): removing touch in pouring and RGB in peg insertion for MS-Bot.

BibTeX

@inproceedings{feng2024play,
    title={Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation},
    author={Feng, Ruoxuan and Hu, Di and Ma, Wenke and Li, Xuelong},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
    url={https://openreview.net/forum?id=N5IS6DzBmL}
}

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Conference on Robot Learning 2024 (Oral)

Supplementary video

Abstract

Modality Temporality

Method

Experiments

Main Experiments

Visualization

Generalization

BibTeX