Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Computer Vision and Pattern Recognition (CVPR) 2025

Henghui Du^1,3, Guangyao Li², Chang Zhou³, Chunjie Zhang³, Alan Zhao³, Di Hu¹

¹Renmin University of China ²Tsinghua University
³ AI Technology Center, Online Video Business Unit, Tencent PCG

Overview

We present Crab, a unified audio-visual scene understanding model with explicit cooperation, which can complete various audio-visual tasks. It is trained on an instruction-tuning dataset with explicit reasoning process, which clarifies the cooperative relationship among tasks. Furthermore, to alleviate the interference caused by the learning process of complex audiovisual data and facilitate concrete cooperation, an interaction-aware LoRA structure is designed to enable the model focus on different aspects of data interaction.

AV-UIE: Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process

AV-UIE is an audio-visual unified instruction-tuning dataset with explicit reasoning process, which is primarily an augmentation of existing audio-visual task datasets. Through above dataset construction process, a detailed reasoning process can be obtained, which contains rich information in terms of temporal and spatial, conductive to temporal and spatial localization tasks.

Crab Model

Crab model mainly consists of two parts: unified audio-visual interface, which consists of three multimodal branches, and a large language model with interaction-aware LoRA structure. The audio branch and visual branch process audio and video inputs respectively, while the segmentation branch is responsible for outputting the segmentation mask. The model is trained on our AV-UIE dataset, which clarifies the cooperation relationship among tasks, as marked by different colors on the right side of the figure. Content of same color in different tasks can help model establish cooperative relationship among tasks. Furthermore, to alleviate the interference caused by the learning process of complex audiovisual data, we design an interaction-aware LoRA structure to facilitate concrete cooperation.

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Computer Vision and Pattern Recognition (CVPR) 2025

Overview

AV-UIE: Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process

Crab Model

Experiments

Comparision with general models

Comparison with specialized model on AVE & AVVP dataset

Comparison with specialized model on AVS & Ref-AVS dataset

Comparison with specialized model on MUSIC-AVQA dataset

Visualized results

BibTeX