Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

SphereDiffusion: Spherical Geometry-aware Distortion Resilient Diffusion Model

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Geometric-Inspired Graph-based Incomplete Multi-view Clustering

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Towards Inadequately Pre-trained Models in Transfer Learning

Balanced Audiovisual Dataset for Imbalance Analysis

Towards Long Form Audio-visual Video Understanding

Multi-Scale Attention for Audio Question Answering

Supervised Knowledge May Hurt Novel Class Discovery Performance

Robust Cross-modal Knowledge Distillation for Unconstrained Videos

Revisiting Pre-training in Audio-Visual Learning

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Exploiting Visual Context Semantics for Sound Source Localization

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Self-supervised Learning for Heterogeneous Audiovisual Scene Analysis

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Balanced Multimodal Learning via On-the-fly Gradient Modulation

Not All Knowledge Is Created Equal

Visual Sound Localization in-the-Wild by Cross-Modal Interference Erasing

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

Unsupervised Multi-Source Domain Adaptation for Person Re-Identification

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Generalising Combinatorial Discriminant Analysis through Conditioning Truncated Rayleigh Flow

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Multiple Sound Sources Localization from Coarse to Fine

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning

Does Ambient Sound Help? - Audiovisual Crowd Counting

Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

A Two-Stage Framework for Multiple Sound-Source Localization

Curriculum Audiovisual Learning

Deep Linear Discriminant Analysis Hashing

Discrete Spectral Hashing for Efficient Similarity Retrieval

Deep Multimodal Clustering for Unsupervised Audiovisual Learning Representation

Listen to the Image

Deep Binary Reconstruction for Cross-modal Hashing

Dense Multimodal Fusion for Hierarchically Joint Representation

Large Graph Hashing with Spectral Rotation

Deep Binary Reconstruction for Cross-modal Hashing

Image2song: Song Retrieval via Bridging Image Content and Lyric Words

Multimodal Learning via Exploring Deep Semantic Similarity

Temporal Multimodal Learning in Audiovisual Speech Recognition