Unveiling and Mitigating Bias in Audio Visual Segmentation

1Beijing University of Posts and Telecommunications,
2 Renmin University of China, China,
*Indicates Corresponding Author.
MY ALT TEXT

Abstract

Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types “audio priming bias” and “visual prior” according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.

Problem and Analysis

MY ALT TEXT


Audio priming bias: The phenomenon that the model tends to focus on audio salient content but not whole content is called “audio priming bias"”.

  1. Audio with different intensities demonstrates varying guiding capability, as shown in the green block.
  2. When controlling other variables including volume, we can observe a clear variance of the guiding capability by different semantics through the box plot.
  3. In cases where multiple audios are simultaneously present, the overlaying of audio does not always lead to separate related masks being superimposed.

Visual prior: The phenomenon that the model may directly segment the common-sounding objects is called “visual prior”.

  1. Statistically, the occurrence frequency the sounding probability between different semantics are imbalanced in the dataset.
  2. Such preference and distribution provide strong prior information and make the model inclined to obtain statistically plausible results, rather than achieving the desired challenging grounding behavior.

Method

The mitigation of audio priming bias requires an enhancing mechanism in the transformer decoder, while the mitigation of visual prior requires distribution reorganization after acquiring the logits.

MY ALT TEXT


On one hand, to enhance the audio sensitivity and cooperation of different intensity and semantic attributes, we introduce the semantic-aware active queries utilizing a perception module and interaction enhancement mechanism.

MY ALT TEXT

On the other hand, multiple training strategies are explored to contrastively optimize the debias model and reorganize the logits without modifying the structure.

Quantitative Comparision

MY ALT TEXT

Quantitative (mIoU, F-score) results on AVSBench benchmarks with transformer-based visual backbone.

BibTeX

@article{sun2024unveiling,
        title={Unveiling and Mitigating Bias in Audio Visual Segmentation},
        author={Sun, Peiwen and Zhang, Honggang and Hu, Di},
        journal={Proceedings of the 32th ACM International Conference on Multimedia (ACM MM)},
        year={2024},
       }