MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

1Gaoling School of Artificial Intelligence, Renmin University of China,
2Tencent AI Lab

Abstract

Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model`s performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution.

Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise $L_2$ normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric.

Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.

Imbalanced optimization in multi-modal co-learning



(a,b) In the end-to-end training of an audio-visual concatenation-based network for classification, the dominant audio modality rapidly handles the overall model performance and the joint logit scores, while the visual modality keeps under-optimized. (c,d) Further tracking on modality-wise norm of weight vectors indicates the easily-trained audio encoder tends to have its weight in norm growing much faster than the weak visual modality.

Method



To deal with the above problem, we propose a multi-modal cosine loss, MMCosine. The main steps are (a) Modality-wise normalization of weight and feature to mitigate the imbalance and (b)scaling with hyperparameter $s$ to guarantee the convergence.

We also give the lower bound of $s$ given expected posterior probability $p$ of ground-truth label and the number of total labels $C$. The demonstration can be found in the supplementary material. $$s\geq \frac{C-1}{2(C+1)}log\frac{(C-1)p}{1-p} $$

Results



$\dagger$ indicates MMCosine is applied. Combined with MMCosine, most of the fusion methods gain considerable improvement for datasets of various scales, domains, and label amount.

Gap Mitigation

The performance gap of uni-modal encoders is reduced by MMCosine, with the weak modality and the joint model boosted.

Cosine discriminability

The learned angles between uni-modal features and ground-truth class centers become more compact. MMCosine can lower the intra-class angular variation and maximize the discriminability of cosine metric.

BibTeX

@article{ruize2023mmcosine,
  author={Ruize, Xu and Ruoxuan, Feng and Shi-xiong, Zhang, and Di, Hu},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023},
  organization={IEEE},
}