Publications

(* represents the equal contribution, red title represents oral paper)

Search Paper
Publication Types

On-the-fly Modulation for Balanced Multimodal Learning

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Unveiling and Mitigating Bias in Audio Visual Segmentation (ACM MM Oral)

Diagnosing and Re-learning for Balanced Multimodal Learning

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Can Textual Semantics Mitigate Sounding Object SegmentationPreference?

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

MMPareto: Innocent Uni-modal Assistance for Enhanced Multi-modal Learning

Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

SphereDiffusion: Spherical Geometry-aware Distortion Resilient Diffusion Model

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Geometric-Inspired Graph-based Incomplete Multi-view Clustering

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Towards Inadequately Pre-trained Models in Transfer Learning

Balanced Audiovisual Dataset for Imbalance Analysis

Towards Long Form Audio-visual Video Understanding

Multi-Scale Attention for Audio Question Answering

Supervised Knowledge May Hurt Novel Class Discovery Performance

Robust Cross-modal Knowledge Distillation for Unconstrained Videos

Revisiting Pre-training in Audio-Visual Learning

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Exploiting Visual Context Semantics for Sound Source Localization

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Self-supervised Learning for Heterogeneous Audiovisual Scene Analysis

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Balanced Multimodal Learning via On-the-fly Gradient Modulation

Not All Knowledge Is Created Equal

Visual Sound Localization in-the-Wild by Cross-Modal Interference Erasing

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

Unsupervised Multi-Source Domain Adaptation for Person Re-Identification

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Generalising Combinatorial Discriminant Analysis through Conditioning Truncated Rayleigh Flow

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Multiple Sound Sources Localization from Coarse to Fine

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning

Does Ambient Sound Help? - Audiovisual Crowd Counting

Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

A Two-Stage Framework for Multiple Sound-Source Localization

Curriculum Audiovisual Learning

Deep Linear Discriminant Analysis Hashing

Discrete Spectral Hashing for Efficient Similarity Retrieval

Deep Multimodal Clustering for Unsupervised Audiovisual Learning Representation

Listen to the Image

Deep Binary Reconstruction for Cross-modal Hashing

Dense Multimodal Fusion for Hierarchically Joint Representation

Large Graph Hashing with Spectral Rotation

Deep Binary Reconstruction for Cross-modal Hashing

Image2song: Song Retrieval via Bridging Image Content and Lyric Words

Temporal Multimodal Learning in Audiovisual Speech Recognition