Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions.
For instance, as shown in this Figure, Ref-AVS challenges machines to locate objects of interest in the visual space using multimodal cues, just like humans do in the real world, comparing to the Ref-AVS task with other related tasks..
To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions.
We choose to manually collect videos from YouTube. Specifically, 20 categories of musical instruments, 8 of animals, 15 of machines, and 5 of humans. Annotations are collected using our customized GSAI-Labeled system.
Videos and Frames in Ref-AVS are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.
Illustrations of our MUSIC-AVQA dataset examples. Our benchmark is meticulously designed to encompass multimodal expressions from multiple dimensions. By combining various types of modality expressions, we achieve a dataset that exhibits great diversity.
The distribution for both expressions and objects. It visualizes the co-occurrence of objects in our dataset, where we can observe a dense web of connections spanning various categories, such as musical instruments, people, vehicles, etc. The rich combination of categories indicates that our dataset is not limited to a narrow set of scenarios but rather encompasses a broad spectrum of real-life scenes where such objects are likely to naturally appear together.
We design an audio-visual customized system to collect question and segmentation label, and all information are collected with this system. The flow chart of the labeling system is shown in below figure.
Dataset collection pipeline. From the beginning, this pipeline has played a significant role in ensuring the efficiency and cost-effectiveness of the overall process, leading to the successful acquisition of high-quality samples.
Some video examples with QA pairs in the MUSIC-AVQA dataset. Through these examples, we can have a better understanding of the dataset, and can more intuitively feel the QA tasks in dynamic and complex audio-visual scenes
Frames and audio: Available at Zenodo (16.1GB) or Baidu Disk (16.1GB) (pwd: eccv)
Frames and audio (null set): Available at Zenodo (1.1GB) or Baidu Disk (1.1GB) (pwd: eccv)
Ground truth: Available at Zenodo (0.2GB) or Baidu Disk (0.2GB) (pwd: eccv)
Annotations: Available for download at Zenodo or Baidu Disk (pwd: eccv)
If you find our work useful in your research, please cite our ECCV 2024 paper.
@ARTICLE{Wang2022Ref,
title={Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes},
author={Wang, Yaoting and Sun, Peiwen and Zhou, Dongzhan and Li, Guangyao and Zhang, Honggang and Hu, Di},
journal = {IEEE European Conference on Computer Vision (ECCV)},
year = {2024},
}
The released Ref-AVS dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.
All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
To solve the Ref-AVS problem, we propose a audio-visual grounding model with per-mask segmentation to achieve scene understanding and grounding over audio, visual and language modalities. And to benchmark different models, we use mIoU and F-Score as the evaluation metric. More details in the [Paper] and [Supplementary].
An overview of the proposed framework is illustrated in below figure.
Detail can be found in [Paper]
To study different input modalities and validate the effectiveness of the proposed model, we conduct extensive ablations of our model and compare to recent Refer-VOS and AVS approaches.
As shown in table below, we use three test subsets to evaluate the comprehensive ability of Ref-AVS methods. Mix is the average performance of Seen and Unseen test set. We also use Null test set to evaluate the robustness of multimodal-cue expression guidance.
We conduct ablation studies to investigate the impact of two modalities (audio and text) information on the Ref-AVS task, as well as the effectiveness of the proposed method.
We are a group of researchers working in computer vision from the Renmin University of China , Beijing University of Posts and Telecommunictaions and Shanghai AI Lab.