Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Computer Vision and Pattern Recognition (CVPR) 2025

Henghui Du1,3, Guangyao Li2, Chang Zhou3, Chunjie Zhang3, Alan Zhao3, Di Hu1
1Renmin University of China     2Tsinghua University
3 AI Technology Center, Online Video Business Unit, Tencent PCG

Overview



We present Crab, a unified audio-visual scene understanding model with explicit cooperation, which can complete various audio-visual tasks. It is trained on an instruction-tuning dataset with explicit reasoning process, which clarifies the cooperative relationship among tasks. Furthermore, to alleviate the interference caused by the learning process of complex audiovisual data and facilitate concrete cooperation, an interaction-aware LoRA structure is designed to enable the model focus on different aspects of data interaction.

AV-UIE: Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process



AV-UIE is an audio-visual unified instruction-tuning dataset with explicit reasoning process, which is primarily an augmentation of existing audio-visual task datasets. Through above dataset construction process, a detailed reasoning process can be obtained, which contains rich information in terms of temporal and spatial, conductive to temporal and spatial localization tasks.

Crab Model



Crab model mainly consists of two parts: unified audio-visual interface, which consists of three multimodal branches, and a large language model with interaction-aware LoRA structure. The audio branch and visual branch process audio and video inputs respectively, while the segmentation branch is responsible for outputting the segmentation mask. The model is trained on our AV-UIE dataset, which clarifies the cooperation relationship among tasks, as marked by different colors on the right side of the figure. Content of same color in different tasks can help model establish cooperative relationship among tasks. Furthermore, to alleviate the interference caused by the learning process of complex audiovisual data, we design an interaction-aware LoRA structure to facilitate concrete cooperation.

Experiments

Comparision with general models

Comparison with specialized model on AVE & AVVP dataset

Comparison with specialized model on AVS & Ref-AVS dataset

Comparison with specialized model on MUSIC-AVQA dataset

Visualized results

BibTeX

@article{du2025crab,
  title={Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation},
  author={Du, Henghui and Li, Guangyao and Zhou, Chang and Zhang, Chunjie and Zhao, Alan and Hu, Di},
  journal={arXiv preprint arXiv:2503.13068},
  year={2025}
}