Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

[ECCV2024]

    Yaoting Wang1,†, Peiwen Sun2,†, Dongzhan Zhou3,†, Guangyao Li1, Honggang Zhang2, Di Hu1,*
    1Renmin University of China, 2Beijing University of Posts and Telecommunictions, 3Shanghai AI Lab

[Paper]  [Supplementary]  [arXiv]  [Code]

Update

  • 15 July 2024: Camera-ready version has been released here!
  • 14 July 2022: Code has been released here!
  • 01 July 2024: Our paper is accepted for publication at ECCV2024. Camera-ready version and code will be released soon!

What is Ref-AVS task?

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions.

For instance, as shown in this Figure, Ref-AVS challenges machines to locate objects of interest in the visual space using multimodal cues, just like humans do in the real world, comparing to the Ref-AVS task with other related tasks..


What is Ref-AVS dataset?

To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions.

Basic informations

We choose to manually collect videos from YouTube. Specifically, 20 categories of musical instruments, 8 of animals, 15 of machines, and 5 of humans. Annotations are collected using our customized GSAI-Labeled system.

Characteristics

  • 20,261 expressions
  • 48 categories
  • 4,002 videos
  • 40,020 frames
  • 6,888 objects
  • Audio, visual, temporal
  • Pixel-level annotation
  • Diversity, complexity and dynamic

Personal data/Human subjects

Videos and Frames in Ref-AVS are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.

Ref-AVS Dataset

Some graphical representations of our dataset and annotations

Illustrations of our MUSIC-AVQA dataset examples. Our benchmark is meticulously designed to encompass multimodal expressions from multiple dimensions. By combining various types of modality expressions, we achieve a dataset that exhibits great diversity.

The distribution for both expressions and objects. It visualizes the co-occurrence of objects in our dataset, where we can observe a dense web of connections spanning various categories, such as musical instruments, people, vehicles, etc. The rich combination of categories indicates that our dataset is not limited to a narrow set of scenarios but rather encompasses a broad spectrum of real-life scenes where such objects are likely to naturally appear together.


How was Ref-AVS dataset made?


We design an audio-visual customized system to collect question and segmentation label, and all information are collected with this system. The flow chart of the labeling system is shown in below figure.

Dataset collection pipeline. From the beginning, this pipeline has played a significant role in ensuring the efficiency and cost-effectiveness of the overall process, leading to the successful acquisition of high-quality samples.


More video examples


Some video examples with QA pairs in the MUSIC-AVQA dataset. Through these examples, we can have a better understanding of the dataset, and can more intuitively feel the QA tasks in dynamic and complex audio-visual scenes

Download

Dataset publicly available for research purposes

Data and Download


Frames and audio: Available at Zenodo (16.1GB) or Baidu Disk (16.1GB) (pwd: eccv)

Frames and audio (null set): Available at Zenodo (1.1GB) or Baidu Disk (1.1GB) (pwd: eccv)

Ground truth: Available at Zenodo (0.2GB) or Baidu Disk (0.2GB) (pwd: eccv)

Annotations: Available for download at Zenodo or Baidu Disk (pwd: eccv)

Publication(s)

If you find our work useful in your research, please cite our ECCV 2024 paper.

        
        @ARTICLE{Wang2022Ref,
          title={Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes},
          author={Wang, Yaoting and Sun, Peiwen and Zhou, Dongzhan and Li, Guangyao and Zhang, Honggang and Hu, Di},
          journal   = {IEEE European Conference on Computer Vision (ECCV)},
          year      = {2024},
          
        }
        

Disclaimer

The released Ref-AVS dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.


Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

A Simple Baseline for Ref-AVS

Audio-visual spatial-temporal Ref-AVS method, experimental results and simple analysis

To solve the Ref-AVS problem, we propose a audio-visual grounding model with per-mask segmentation to achieve scene understanding and grounding over audio, visual and language modalities. And to benchmark different models, we use mIoU and F-Score as the evaluation metric. More details in the [Paper] and [Supplementary].

Pipeline

An overview of the proposed framework is illustrated in below figure.

Detail can be found in [Paper]


Experiments

To study different input modalities and validate the effectiveness of the proposed model, we conduct extensive ablations of our model and compare to recent Refer-VOS and AVS approaches.

As shown in table below, we use three test subsets to evaluate the comprehensive ability of Ref-AVS methods. Mix is the average performance of Seen and Unseen test set. We also use Null test set to evaluate the robustness of multimodal-cue expression guidance.

We conduct ablation studies to investigate the impact of two modalities (audio and text) information on the Ref-AVS task, as well as the effectiveness of the proposed method.

The Team

We are a group of researchers working in computer vision from the Renmin University of China , Beijing University of Posts and Telecommunictaions and Shanghai AI Lab.


Acknowledgement

  • This research was supported by Public Computing Cloud, Renmin University of China.
  • This web-page design inspired by EPIC official website.