Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

1University of Chinese Academy of Sciences,
2Beijing University of Posts and Telecommunications,
3Gaoling School of Artificial Intelligence, Renmin University of China, China,
4Engineering Research Center of Next-Generation Search and Recommendation
ECCV 2024

*Indicates Corresponding Author.
MY ALT TEXT

Abstract

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called Stepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks..

Method

MY ALT TEXT

Overview of AAVS framework. (1) Visual and audio features are extracted by the pre-trained encoder; (2) Adaptive Audio Query Generator is proposed to generate audio queries; (3) In the transformer decoder, audio-aware queries are integrated with visual feature maps, and masked cross-attention facilitates queries to dynamically adjust the attention range; (4) Finally, refined queries are merged with the mask feature to obtain the final prediction mask. Red arrows indicate newly introduced methods when implementing the \textit{Stepping Stones} strategy.

Quantitative Comparision

MY ALT TEXT

Quantitative (mIoU, F-score) results on AVSBench dataset with transformer-based visual backbone.

* indicates that the model uses the Stepping Stones strategy.

Qualitative Comparision

Video cases will be updated soon.

Paper

BibTeX

@article{ma2024steppingstones,
        title={Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation},
        author={Ma, Juncheng and Sun, Peiwen and Wang, Yaoting and Hu, Di},
        journal={IEEE European Conference on Computer Vision (ECCV)},
        year={2024},
       }