Skip to the content.

Overview

This is a curated list of audio-visual learning methods and datasets, based on our survey: <Learning in Audio-visual Context: A Review, Analysis, and New Perspective>. This list will continue to be updated, please feel free to nominate good related works with Pull Requests!

[Website of Our Survey], [arXiv]

Table of contents

Audio-visual Boosting

Audio-visual Recognition

Speech Recognition

[Applied Intelligence-2015] Audio-visual Speech Recognition Using Deep Learning
Authors: Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata
Institution: Waseda University; Kyoto University; Honda Research Institute Japan Co., Ltd.

[CVPR-2016] Temporal Multimodal Learning in Audiovisual Speech Recognition
Authors: Di Hu, Xuelong Li, Xiaoqiang Lu
Institution: Northwestern Polytechnical University; Chinese Academy of Sciences

[AVSP-2017] End-To-End Audiovisual Fusion With LSTMs
Authors: Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
Institution: Imperial College London; University of Twente

[IEEE TPAMI-2018] Deep Audio-visual Speech Recognition
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
Institution: University of Oxford; Google Inc.

[2019] Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
Authors: Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun
Institution: Peking University

[IEEE TNNLS-2022] Multimodal Sparse Transformer Network for Audio-visual Speech Recognition
Authors: Qiya Song, Bin Sun, Shutao Li
Institution: Hunan University

[Interspeech-2022] Robust Self-Supervised Audio-visual Speech Recognition
Authors: Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Institution: Toyota Technological Institute at Chicago; Meta AI

[2022] Bayesian Neural Network Language Modeling for Speech Recognition
Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng
Institution: the Chinese University of Hong Kong

[Interspeech-2022] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology; Genesis Lab Inc.

[MLSP-2022] Rethinking Audio-visual Synchronization for Active Speaker Detection
Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang
Institution: Tsinghua University; Beijing National Research Center for Information Science and Technology; University of Rochester

[NeurIPS-2022] A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Toyota Technological Institute at Chicago

[ITOEC-2022] FSMS: An Enhanced Polynomial Sampling Fusion Method for Audio-Visual Speech Recognition
Authors: Chenghan Li; Yuxin Zhang; Huaichang Du
Institution: Communication University of China

[IJCNN-2022] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
Authors: Julius Richter; Jeanine Liebold; Timo Gerkamnn
Institution: Universität Hamburg

[ICIP-2022] Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition
Authors: Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
Institution: University of Science and Technology of China; Chinese Academy of Sciences; iFLYTEK Co., Ltd.

[ICASSP-2023] Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Authors: Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; iFLYTEK Co. Ltd.

[CVPR-2022] Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Authors: Dan Oneaţă, Horia Cucu
Institution: University POLITEHNICA of Bucharest

[AAAI-2022] Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology

[AAAI-2023] Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
Institution: Nanyang Technological University; ZJU-Hangzhou Global Scientific and Technological Innovation Center; Zhejiang University

[WACV-2023] Audio-Visual Efficient Conformer for Robust Speech Recognition
Authors: Maxime Burchi, Radu Timofte
Institution: University of Würzburg

[2023] Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute

[2023] Multimodal Speech Recognition for Language-Guided Embodied Agents
Authors: Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason
Institution: University of Southern California

[2023] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang
Institution: Meta AI

[ICASSP-2023] The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge
Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen
Institution: Northwestern Polytechnical University

[CVPR-2023] Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology

[ICASSP-2023] Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Meta AI

[CVPR-2023] AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Institution: Google Research

[CVPR-2023] SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Institution: University of Surrey; Meta AI

[ICASSP-2023] Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology

[ICASSP-2023] On the Role of LIP Articulation in Visual Speech Perception
Authors: Zakaria Aldeneh, Masha Fedzechkina, Skyler Seto, Katherine Metcalf, Miguel Sarabia, Nicholas Apostoloff, Barry-John Theobald
Institution: Apple Inc.

[ICASSP-2023] Practice of the Conformer Enhanced Audio-Visual Hubert on Mandarin and English
Authors: Xiaoming Ren, Chao Li, Shenjian Wang, Biao Li
Institution: Beijing OPPO Telecommunications Corp., ltd.

[ICASSP-2023] Robust Audio-Visual ASR with Unified Cross-Modal Attention
Authors: Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Institution: Shanghai Jiao Tong University

[IJCAI-2023] Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China

[Interspeech-2023] Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
Institution: The University of Texas at Austin; Carnegie Mellon University

[Interspeech-2023] Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning
Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima
Institution: Waseda University; Waseda Research Institute for Science and Engineering

[ACL-2023] AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Authors: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao
Institution: Zhejiang University; ByteDance

[ACL-2023] Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China

[ACL-2023] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen

[IJCNN-2023] Exploiting Deep Learning for Sentence-Level Lipreading
Authors: Isabella Wu, Xin Wang
Institution: Choate Rosemary Hall; Stony Brook University

[IJCNN-2023] GLSI Texture Descriptor Based on Complex Networks for Music Genre Classification
Authors: Andrés Eduardo, Coca Salazar
Institution: Federal University of Technology - Paraná

[ICME-2023] Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
Authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee
Institution: University of Science and Technology of China; Alibaba Group; Georgia Institute of Technology

[ICME-2023] Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition
Authors: Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui
Institution: Ocean University of China; University of Technology Sydney

[AAAI-2024] Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
Institution: NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China; Tencent AI LAB; Nanyang Technological University, Singapore

[CVPR-2024] A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Authors: Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma, Haotian Wang, Chin-Hui Lee
Institution: University of Science and Technology of China, Hefei, China; Georgia Institute of Technology, Atlanta, America

[IJCNN-2024] Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng
Institution: The Chinese University of Hong Kong

[InterSpeech-2024] LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha
Institution: University of Maryland, College Park, USA

[InterSpeech-2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Institution: MIT, USA; IBMResearch AI, USA; MIT-IBM Watson AI Lab, USA; University of Bonn, Germany

Speaker Recognition

[MTA-2016] Audio-visual Speaker Diarization Using Fisher Linear Semi-discriminant Analysis
Authors: Nikolaos Sarafianos, Theodoros Giannakopoulos, Sergios Petridis
Institution: National Center for Scientific Research “Demokritos”

[ICASSP-2018] Audio-visual Person Recognition in Multimedia Data From the Iarpa Janus Program
Authors: Gregory Sell, Kevin Duh, David Snyder, Dave Etter, Daniel Garcia-Romero
Institution: The Johns Hopkins University

[ICASSP-2019] Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion
Authors: Suwon Shon, Tae-Hyun Oh, James Glass
Institution: MIT Computer Science and Artificial Intelligence Laboratory, Cambridge

[Interspeech-2019] Who Said That?: Audio-visual Speaker Diarisation Of Real-World Meetings
Authors: Joon Son Chung, Bong-Jin Lee, Icksang Han
Institution: Naver Corporation

[ICASSP-2020] Self-Supervised Learning for Audio-visual Speaker Diarization
Authors: Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang
Institution: University of Central Florida; Tencent AI Lab; Beijing University of Posts and Telecommunications

[ICASSP-2021] A Multi-View Approach to Audio-visual Speaker Verification
Authors: Leda Sari, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf
Institution: University of Illinois at Urbana-Champaign, Facebook AI Research

[IEEE/ACM TASLP-2021] Audio-visual Deep Neural Network for Robust Person Verification
Authors: Yanmin Qian, Zhengyang Chen, Shuai Wang
Institution: Shanghai Jiao Tong University

[ICDIP 2022] End-To-End Audiovisual Feature Fusion for Active Speaker Detection
Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu
Institution: Interdisciplinary Innovation Research Institute, Zhejiang Lab; Dalhousie University; University of Electronic Science and Technology of China; Zhejiang University

[EUVIP-2022] Active Speaker Recognition using Cross Attention Audio-Video Fusion
Authors: Bogdan Mocanu, Tapu Ruxandra
Institution: University “Politehnica” of Bucharest; Télécom SudParis

[2022] Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Authors: Rahul Sharma, Shrikanth Narayanan
Institution: University of Southern California

[SLT-2023] Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection
Authors: Xuanjun Chen, Haibin Wu, Helen Meng, Hung-yi Lee, Jyh-Shing Roger Jang
Institution: National Taiwan University; The Chinese University of Hong Kong

[ICAI-2023] Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Institution: University of Engineering and Technology Taxila; Swarm Robotics Lab NCRA; Deutsches Elektronen-Synchrotron DESY

[CVPR-2023] A Light Weight Model for Active Speaker Detection
Authors: Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen
Institution: Sichuan University; The Chinese University of Hong Kong

[ICASSP-2023] The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
Authors: Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; Georgia Institute of Technology; Carnegie Mellon University; Kore University of Enna; iFlytek; Northwestern Polytechnical University; Delft University of Technology

[ICASSP-2023] ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting
Authors: Zexu Pan, Wupeng Wang, Marvin Borsdorf, Haizhou Li
Institution: National University of Singapore; University of Bremen; The Chinese University of Hong Kong

[ICASSP-2023] Speaker Recognition with Two-Step Multi-Modal Deep Cleansing
Authors: Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li
Institution: National University of Singapore; A*STAR; The Chinese University of Hong Kong; University of Bremen; Shenzhen Research Institute of Big Data

[ICASSP-2023] Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction
Authors: Timothée Dhaussy, Bassam Jabaian, Fabrice Lefèvre, Radu Horaud
Institution: Avignon University; Université Grenoble Alpes

[ICASSP-2023] Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification
Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang
Institution: Tianjin University; A⋆STAR;Singapore Institute of Technology; National Institute of Informatics

[ICASSP-2023] Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge
Authors: Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu
Institution: Shanghai Jiao Tong University

[ICASSP-2023] Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction
Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng
Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong

[ICASSP-2023] The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge
Authors: Ming Cheng, Haoxu Wang, Ziteng Wang, Qiang Fu, Ming Li
Institution: Wuhan University; Duke Kunshan University; Alibaba Group

[ICASSP-2023] Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning
Authors: Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang
Institution: Tianjin University; A*STAR

[Interspeech-2023] Target Active Speaker Detection with Audio-visual Cues
Authors: Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li
Institution: National University of Singapore; The Chinese University of Hong Kong

[Interspeech-2023] CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition
Authors: Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang
Institution: Tsinghua University; Beijing University of Posts and Telecommunications

[Interspeech-2023] Rethinking the visual cues in audio-visual speaker extraction
Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang
Institution: Tianjin University; National University of Singapore; Shenzhen Research Institute of Big Data

[ICAI-2023] Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Institution: University of Engineering and Technology Taxila; National Centre of Robotics and Automation; Deutsches Elektronen-Synchrotron

[ACL-2023] OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
Authors: Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao
Institution: Zhejiang University; Huawei Cloud

[ICASSP-2023] AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng
Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong

[Interspeech-2023] PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
Authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li
Institution: Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong; National University of Singapore

[IEEE/ACM TASLP-2023] A Dynamic Convolution Framework for Session-Independent Speaker Embedding Learning
Authors: Bin Gu, Jie Zhang, Wu Guo
Institution: University of Science and Technology of China

[IEEE/ACM TASLP-2024] Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification
Authors: Bing Han, Zhengyang Chen, Yanmin Qian
Institution: Shanghai Jiao Tong University

[ICASSP-2024] Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman
Institution: Visual Geometry Group, Department of Engineering Science, University of Oxford, UK

[FG-2024] Dynamic Cross Attention for Audio-Visual Person Verification
Authors: R. Gnana Praveen, Jahangir Alam
Institution: Computer Research Institute of Montreal (CRIM), Montreal, Canada

Action Recognition

[IJCNN-2016] Exploring Multimodal Video Representation For Action Recognition
Authors: Cheng Wang; Haojin Yang; Christoph Meinel
Institution: University of Potsdam

[CVPR-2018] The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
Authors: Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, Cuong Duc Dao
Institution: King Abdullah University of Science and Technology; Stanford University; Universidad del Norte; Universiteit van Amsterdam

[ICCV-2019] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
Institution: University of Bristol; University of Oxford

[ICCV-2019] SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Facebook AI Research

[ICCV-2019] Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference
Authors: Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, Jonathan Huang
Institution: Intel Labs

[CVPR-2020] Listen to Look: Action Recognition by Previewing Audio
Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
Institution: The University of Texas at Austin; Facebook AI Research

[2020] Audiovisual SlowFast Networks for Video Recognition
Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
Institution: University of California; Facebook AI Research

[ICCV-2021] AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
Authors: Rameswar Panda, Chun-Fu(Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris
Institution: MIT-IBM Watson AI Lab; Boston University; Massachusetts Institute of Technology

[2021] Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia

[WACV-2022] Domain Generalization Through Audio-Visual Relative Norm Alignment in First Person Action Recognition
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia; CINI Consortium

[CVPR-2022] Audio-Adaptive Activity Recognition Across Video Domains
Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek
Institution: University of Amsterdam; Inception Institute of Artificial Intelligence

[WACV-2022] MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Authors: Jiawei Chen, Chiu Man Ho
Institution: OPPO US Research Center

[CVPR-2022] Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Authors: Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou
Institution: Shenzhen University; Guangdong Key Laboratory of Intelligent Information Processing; Pazhou Lab

[2022] Noise-Tolerant Learning for Audio-Visual Action Recognition
Authors: Haochen Han, Qinghua Zheng, Minnan Luo, Kaiyao Miao, Feng Tian, Yan Chen
Institution: Xi’an Jiaotong University, the Shanxi Provincial Key Laboratory of Institute of Multimedia Knowledge Fusion and Engineering; the Ministry of Education Key Laboratory for Intelligent Networks and Network Security

[ICLR-2023] Exploring Temporally Dynamic Data Augmentation for Video Recognition
Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee
Institution: NAVER Clova; Korea Advanced Institute of Science and Technology; NAVER AI Lab; Yonsei University

[ICASSP-2023] Epic-Sounds: A Large-scale Dataset of Actions That Sound
Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
Institution: University of Oxford; University of Bristol

[ICASSP-2023] AV-TAD: Audio-Visual Temporal Action Detection With Transformer
Authors: Yangcheng Li, Zefang Yu, Suncheng Xiang, Ting Liu, Yuzhuo Fu
Institution: Shanghai Jiao Tong University

[ICCV-2023] Audio-Visual Glance Network for Efficient Video Recognition
Authors: Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim
Institution: Korea Advanced Institute of Science and Technology

[IEEE TMM-2023] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
Authors: Maregu Assefa, Wei Jiang, Jinyu Zhan, Kumie Gedamu, Getinet Yilma, Melese Ayalew, Deepak Adhikari
Institution: University of Electronic Science and Technology of China; Sichuan Artificial Intelligence Research Institute; Adama Science and Technology University

[CVPR-2024] TIM: A Time Interval Machine for Audio-Visual Action Recognition
Authors: Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen
Institution: University of Bristol; VGG, University of Oxford; Czech Technical University in Prague

Emotion Recognition

[EMNLP-2017] Tensor Fusion Network for Multimodal Sentiment Analysis
Authors: Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University

[AAAI-2018] Multi-attention Recurrent Network for Human Communication Comprehension
Authors: Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University

[AAAI-2018] Memory Fusion Network for Multi-view Sequential Learning
Authors: Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Instituto Polite cnico Nacional; Nanyang Technological University

[NAACL-2018] Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos
Authors: Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, Roger Zimmermann
Institution: National University of Singapore

[EMNLP-2018] Contextual Inter-modal Attention for Multi-modal Sentiment Analysis
Authors: Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna; Nanyang Technological University

[ACL-2019] Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model
Authors: Yitao Cai, Huiyu Cai, Xiaojun Wan
Institution: Peking University

[ACL-2020] Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis
Authors: Dushyant Singh Chauhan, Dhanush S R, Asif Ekbal and Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna

[ACL-2020] A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Authors: Jean-Benoit Delbrouck, Noe Tits, Mathilde Brousmiche, Stephane Dupont
Institution: University of Mons

[ACL-2020] Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation
Authors: Aman Shenoy, Ashish Sardana
Institution: Birla Institute of Technology and Science, Pilani; NVIDIA Graphics

[CVPR-2021] Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences
Authors: Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin
Institution: Southwest Jiaotong University; Southwestern University of Finance and Economics; Tencent; University of Electronic Science and Technology of China; Nanyang Technological University

[IEEE TAFFC-2021] Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations
Authors: Manjot Bedi, Shivani Kumar, Md Shad Akhtar, Tanmoy Chakraborty
Institution: Indraprastha Institute of Information Technology, Delhi

[IEEE SLT-2021] Detecting expressions with multimodal transformers
Authors: Srinivas Parthasarathy, Shiva Sundaram
Institution: Amazon

[CVPR-2022] M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation
Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe
Institution: Sony Research India

[CCC-2022] A Multimodal Emotion Perception Model based on Context-Aware Decision-Level Fusion
Authors: Yishan Chen; Zhiyang Jia; Kaoru Hirota; Yaping Dai
Institution: Beijing Institute of Technology; State Key Laboratory of Intelligent Control and Decision of Complex Systems

[IJCNN-2022] Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis
Authors: Lingyong Fang, Gongshen Liu, Ru Zhang
Institution: Shanghai Jiao Tong University; Beijing University Posts and Telecommunications

[IEEE/ACM TASLP-2022] EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations
Authors: Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology

[ICPR-2022] Self-attention fusion for audiovisual emotion recognition with incomplete data
Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj
Institution: Tampere University; Aarhus University

[IEEE TAFFC-2023] Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels
Authors: Yuanyuan Lei, Houwei Cao
Institution: Texas A&M University; New York Institute of Technology

[IEEE T-BIOM-2023] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Authors: R Gnana Praveen, Patrick Cardinal, Eric Granger
Institution: Ecole de technologie supérieure

[ICASSP-2023] Adapted Multimodal Bert with Layer-Wise Fusion for Sentiment Analysis
Authors: Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos
Institution: National Technical University of Athens; Institute for Language and Speech Processing

[ICASSP-2023] Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition
Authors: R Gnana Praveen, Eric Granger, Patrick Cardinal
Institution: École de Technologie supérieure

[IEEE/ACM TASLP-2023] Exploring Semantic Relations for Social Media Sentiment Analysis
Authors: Jiandian Zeng, Jiantao Zhou, Caishi Huang
Institution: Beijing Normal University; China University of Macau; University of Macau

[CVPR-2023] Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
Authors: Zhicheng Zhang, Lijuan Wang, Jufeng Yang
Institution: Nankai University

[ACM MM-2023] Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Authors: Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie Zhang, Yuzhe Weng
Institution: University of Science and Technology of China; Northwestern Polytechnical University

[arxiv-2024] HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China; Department of Automation, Tsinghua University, Beijing, China; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China

[IJCAI-2024] HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis
Authors: Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu
Institution: Tongji University; Beijing Institute of Technology; University of Oxford; DeepBlue Academy of Sciences

[ICPR-2024] Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition
Authors: Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson
Institution: School of Computing Science, University of Glasgow

[InterSpeech-2024] AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
Authors: Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma
Institution: IIIT-Delhi, India; University of Tartu, Estonia

Uni-modal Enhancement

Speech Enhancement and Separation

[Interspeech-2018] Visual Speech Enhancement
Authors: Aviv Gabbay, Asaph Shamir, Shmuel Peleg
Institution: The Hebrew University of Jerusalem

[Interspeech-2018] The Conversation: Deep Audio-Visual Speech Enhancement
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford

[IEEE TETCI-2018] Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
Authors: Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Institution: Research Center for Information Technology Innovation; National Taiwan University; National Yang-Ming University; Mackay Medical College; Academia Sinica

[ICASSP-2018] Seeing Through Noise: Visually Driven Speaker Separation And Enhancement
Authors: Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem

[GlobalSIP-2019] Visually Assisted Time-Domain Speech Enhancement
Authors: Elham Ideli, Bruce Sharpe, Ivan V. Baji?, Rodney G. Vaughan
Institution: Simon Fraser University; SingSoftNext

[ICASSP-2019] On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement
Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Institution: Aalborg University; Oticon A/S

[InterSpeech-2019] Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues
Authors: Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Tomohiro Nakatani
Institution: Nippon Telegraph & Telephone Corporation

[Interspeech-2019] My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford; Naver Corporation

[2020] Facefilter: Audio-Visual Speech Separation Using Still Images
Authors: Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
Institution: Yonsei University; Naver Corporation

[ICASSP-2020] Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders
Authors: Mostafa Sadeghi, Xavier Alameda-Pineda
Institution: Inria Grenoble Rhone-Alpes

[CVPR-2021] Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation
Authors: Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn
Institution: Yonsei University; Naver Corporation; Korea Aerospace University

[ISCAS-2021] Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
Authors: Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, Chiara Bartolozzi
Institution: Istituto Italiano di Tecnologia; University of Modena and Reggio Emilia

[ICASSP-2022] The Impact of Removing Head Movements on Audio-Visual Speech Enhancement
Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar
Institution: Inria Grenoble; Université Grenoble Alpes; Inria Nancy Grand-Est; Reality Labs Research

[2022] Dual-path Attention is All You Need for Audio-Visual Speech Extraction
Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Institution: University of Illinois at Urbana-Champaign

[ICASSP-2022] Audio-visual multi-channel speech separation, dereverberation and recognition
Authors: Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng
Institution: The Chinese University of Hong Kong; Tencent AI lab

[2022] Audio-visual speech separation based on joint feature representation with cross-modal attention
Authors: Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang
Institution: Northwestern Polytechnical University; Nanchang University

[CVPR-2022] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Authors: Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard
Institution: Massachusetts Institute of Technology; Meta Reality Labs Research

[IEEE MMSP-2022] As We Speak: Real-Time Visually Guided Speaker Separation and Localization
Authors: Piotr Czarnecki, Jakub Tkaczuk
Institution: Warsaw University of Technology

[IEEE HEALTHCOM-2022] A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids
Authors: Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Ahsan Adeel, Amir Hussain, Mathini Sellathurai, Tharmalingam Ratnarajah
Institution: University of Edinburgh; Heriot-Watt Watt University; Edinburgh Napier University; University of Wolverhampton

[CVPR-2022] Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Authors: Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
Institution: University of Oxford

[WACV-2023] BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds
Authors: Youshan Zhang, Jialu Li
Institution: Yeshiva University; Cornell University

[SLT-2023] AVSE Challenge: Audio-Visual Speech Enhancement Challenge
Authors: Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Ondrej Klejch, Mandar Gogate, Kia Dashtipour, Amir Hussain, Peter Bell
Institution: University of Edinburgh

[ICLR-2023] Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
Authors: Haoyue Cheng, Zhaoyang Liu, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime

[WACV-2023] Unsupervised Audio-Visual Lecture Segmentation
Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
Institution: International Institute of Information Technology, Hyderabad

[ISCSLP-2022] Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement
Authors: Chenxi Wang, Hang Chen, Jun Du, Baocai Yin, Jia Pan
Institution: University of Science and Technology of China; iFlytek

[ICASSP-2023] Real-Time Audio-Visual End-to-End Speech Enhancement
Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang
Institution: Microsoft

[ICASSP-2023] Efficient Intelligibility Evaluation Using Keyword Spotting: A Study on Audio-Visual Speech Enhancement
Authors: Cassia Valentini-Botinhao, Andrea Lorena Aldana Blanco, Ondrej Klejch, Peter Bell
Institution: University of Edinburgh

[ICASSP-2023] Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement
Authors: Chenyue Zhang, Hang Chen, Jun Du, Baocai Yin, Jia Pan, Chinhui Lee
Institution: University of Science and Technology of China; iFlytek Co., Ltd.; Georgia Institute of Technology

[ICASSP-2023] Real-Time Audio-Visual End-To-End Speech Enhancement
Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang
Institution: Microsoft

[ICASSP-2023] Audio-Visual Speech Enhancement with a Deep Kalman Filter Generative Model
Authors: Ali Golmakani, Mostafa Sadeghi, Romain Serizel
Institution: Université de Lorraine

[ICASSP-2023] A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement
Authors: Haitao Xu, Liangfa Wei, Jie Zhang, Jianming Yang, Yannan Wang, Tian Gao, Xin Fang, Lirong Dai
Institution: University of Science and Technology of China; Ethereal Audio Lab; Tsinghua Shenzhen International Graduate School

[ICASSP-2023] Egocentric Audio-Visual Noise Suppression
Authors: Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu, Kaustubh Kalgaonkar
Institution: Carnegie Mellon University; Meta

[ICASSP-2023] Dual-Path Cross-Modal Attention for Better Audio-Visual Speech Extraction
Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Institution: University of Illinois at Urbana-Champaign

[ICASSP-2023] On the Role of Visual Context in Enriching Music Representations
Authors: Kleanthis Avramidis, Shanti Stewart, Shrikanth Narayanan
Institution: University of Southern California

[ICASSP-2023] LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders
Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic
Institution: Imperial College London; Meta

[ICASSP-2023] Learning Audio-Visual Dereverberation
Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman
Institution: The University of Texas at Austin; Meta AI

[Interspeech-2023] Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Institution: University of Science and Technology of China

[Interspeech-2023] Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
Authors: Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann
Institution: Tsinghua University; Universität Hamburg; Chinese Institute for Brain Research

[ITG-2023] Audio-Visual Speech Enhancement with Score-Based Generative Models
Authors: Julius Richter, Simone Frintrop, Timo Gerkmann
Institution: Universität Hamburg

[Interspeech-2023] Speech inpainting: Context-based speech synthesis guided by video
Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen
Institution: Universitat Pompeu Fabra; Aalborg University; Oticon A/S

[EUSIPCO-2023] Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction
Authors: Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima
Institution: Waseda University; Waseda Research Institute for Science and Engineering

[IEEE/ACM TASLP-2023] Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Institution: The Chinese University of Hong Kong

[ICCV-2023] AdVerb: Visually Guided Audio Dereverberation
Authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
Institution: University of Maryland; University of Montreal

[TEEE/ACM TASLP-2023] Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments
Authors: Jiaming Xu, Jian Cui, Yunzhe Hao, Bo Xu
Institution: Xiaomi Corporation; University of Chinese Academy of Sciences

[ICASSP-2024] Consistent and Relevant: Rethink the Query Embedding in General Sound Separation
Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng
Institution: Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Tencent AI Lab, Audio and Speech Signal Processing Oteam, China; The Chinese University of Hong Kong, Hong Kong SAR, China;

[ICASSP-2024] SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech
Authors: Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi
Institution: Cisco Systems, Inc

[IJCAI-2024] Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Authors: Zhaoxi Mu, Xinyu Yang
Institution: Xi’an Jiaotong University

[InterSpeech-2024] FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology, South Korea

[InterSpeech-2024] RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
Authors: Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic
Institution: Meta AI, UK; Imperial College London, UK

[ACM MM-2024] RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Authors: Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu
Institution: Nanjing University, State Key Laboratory for Novel, Software Technology, Nanjing, China;

[INTERSPEECH-2024] LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
Authors: Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary, Krish Agrawal, Rupal Shah, Rohan Jha, M. Sajid, Amir Hussain, M. Tanveer
Institution: Indian Institute of Technology Indore, Simrol, Indore, 453552, India; School of Computing, Edinburgh Napier University, EH11 4BN, Edinburgh, United Kingdom;

Object Sound Separation

[ECCV-2018] Learning to Separate Object Sounds by Watching Unlabeled Video
Authors: Ruohan Gao, Rogerio Feris, Kristen Grauman
Institution: The University of Texas at Austin; IBM Research; Facebook AI Research

[ECCV-2018] The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University

[ICASSP-2019] Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2019] The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2019] Recursive Visual Sound Separation Using Minus-Plus Net
Authors: Xudong Xu, Bo Dai, Dahua Lin
Institution: The Chinese University of Hong Kong

[ICCV-2019] Co-Separating Sounds of Visual Objects
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

[ACCV-2020] Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
Authors: Lingyu Zhu, Esa Rahtu
Institution: Tampere University

[CVPR-2020] Music Gesture for Visual Sound Separation
Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2021] Visual Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois at Urbana-Champaign; Mitsubishi Electric Research Laboratories

[CVPR-2021] Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
Authors: Yapeng Tian, Di Hu, Chenliang Xu
Institution: University of Rochester; Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods

[ECCV-2022] AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign

[ECCV-2022] AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign

[ICIP-2022] Visual Sound Source Separation with Partial Supervision Learning
Authors: Huasen Wang, Lingling Gao, Qianchao Tan, Luping Ji
Institution: University of Electronic Science and Technology of China

[NeurIPS-2022] Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois; Mitsubishi Electric Research Labs

[ICLR-2023] CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
Institution: Sony Group Corporation; University of California San Diego

[CVPR-2023] Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon,
Institution: Oriol Nieto, Bryan Russell, Kate Saenko Boston University; Adobe Research; MIT-IBM Watson AI Lab, IBM Research

[CVPR-2023] iQuery: Instruments As Queries for Audio-Visual Sound Separation
Authors: Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, Jianbo Shi
Institution: University of California San Diego; Shanghai AI Laboratory; The Chinese University of Hong Kong; National University of Singapore; ShanghaiTech University; University of Pennsylvania

[ICCV-2023] Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
Authors: Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu
Institution: Michigan State University; University of Rochester; University of Texas at Dallas

[WACV-2024] LAVSS: Location-Guided Audio-Visual Spatial Audio Separation
Authors: Yuxin Ye, Wenming Yang, Yapeng Tian
Institution: Tsinghua University; The University of Texas at Dallas

Face Super-resolution and Reconstruction

[CVPR-2020] Learning to Have an Ear for Face Super-Resolution
Authors: Givi Meishvili, Simon Jenni, Paolo Favaro
Institution: University of Bern

[IEEE TCSVT-2021] Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer
Authors: Chenqi Kong, Baoliang Chen, Wenhan Yang, Haoliang Li, Peilin Chen, Shiqi Wang
Institution: City University of Hong Kong; Nanyang Technological University

[ICASSP-2022] Deep Video Inpainting Guided by Audio-Visual Self-Supervision
Authors: Kyuyeon Kim; Junsik Jung; Woo Jae Kim; Sung-Eui Yoon
Institution: Korea Advanced Institute of Science and Technology

[CVPR-2022] Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
Authors: Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann
Institution: University of Southern California

[WACV-2023] Audio-Visual Face Reenactment
Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
Institution: International Institute of Information Technology, Hyderabad; University of Bath

[ICASSP-2023] Hearing and Seeing Abnormality: Self-Supervised Audio-Visual Mutual Learning for Deepfake Detection
Authors: Changsung Sung, Juncheng Chen, Chusong Chen
Institution: National Taiwan University; Academia Sinica

[CVPR-2023] AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
Authors: Aggelina Chatziagapi, Dimitris Samaras
Institution: Stony Brook University

[CVPR-2023] Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
Authors: Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li
Institution: Sun Yat-sen University; Cardiff University

[CVPR-2023] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
Authors: Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong
Institution: The Chinese University of Hong Kong; Tencent AI Lab

[ICASSP-2024] Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection
Authors: Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson
Institution: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, U.K.

[CVPR-2024] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj
Institution: University of Maryland - College Park; Reality Defender Inc.

[BMVC-2024] Content and Style Aware Audio-Driven Facial Animation
Authors: Qingju Liu, Hyeongwoo Kim, Gaurav Bharaj
Institution: Flawless AI, UK; Imperial College London, UK;

[BMVC-2024] Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies
Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada
Institution: Computer Vision, Imaging & Machine, Intelligence Research Group (CVI2), Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg; Cristal Laboratory, National School of Computer Sciences, Manouba University, Tunisia;

[SIGGRAPH-2024] PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
Authors: Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu
Institution: Bytedance,China;

Cross-modal Perception

Cross-modal Generation

Mono Sound Generation

Speech

[ICASSP-2017] Vid2speech: Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Shmuel Peleg
Institution: The Hebrew University of Jerusalem

[ICCV-2017] Improved Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem

[ICASSP-2018] Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video
Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani
Institution: Columbia University

[ACM MM-2018] Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Authors: Yaman Kumar, Mayank Aggarwa, Pratham Nawal, Shin’ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann
Institution: Netaji Subhas Institute of Technology; National Institute of Informatics; Indraprastha Institute of Information; National University of Singapore

[2019] Video-Driven Speech Reconstruction using Generative Adversarial Networks
Authors: Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Centre

[Interspeech-2019] Hush-Hush Speak: Speech Reconstruction Using Silent Videos
Authors: Shashwat Uttam, Yaman Kumar Singla, Dhruva Sahrawat, Mansi Agarwal
Institution: Netaji Subhas Institute of Technology; Adobe Research; National University of Singapore; Delhi Technological University

[ICASSP-2021] Learning Audio-Visual Correlations From Variational Cross-Modal Generation
Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Technology Sydney; Cisco

[IEEE TCYB-2022] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bj?rn W. Schuller, Maja Pantic
Institution: Imperial College London; University of Augsburg; Meta AI

[ICPR-2022] Learning Speaker-specific Lip-to-Speech Generation
Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M Hegde
Institution: Indian institute of Technology; University of Bath

[ICASSP-2023] Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
Authors: Jiyoung Lee, Joon Son Chung, Soo-Whan Chung
Institution: NAVER AI Lab; Korea Advanced Institute of Science and Technology; NAVER Cloud

[CVPR-2023] ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration
Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
Institution: Meta AI; Meta Reality Labs Research; Toyota Technological Institute at Chicago; The Hebrew University of Jerusalem

[ICCV-2023] DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology

[CVPR-2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Authors: Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen
Institution: HKUST; ARCLab,Tencent PCG

Music

[IEEE TMM-2015] Real-Time Piano Music Transcription Based on Computer Vision
Authors: Mohammad Akbari, Howard Cheng
Institution: Simon Fraser University; University of Lethbridge

[ACM MM-2017] Deep Cross-Modal Audio-Visual Generation
Authors: Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester

[NeurIPS-2020] Audeo: Audio Generation for a Silent Performance Video
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington

[ECCV-2020] Foley Music: Learning to Generate Music from Videos
Authors: Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
Institution: Cambridge

[ICASSP-2020] Sight to Sound: An End-to-End Approach for Visual Piano Transcription
Authors: A. Sophia Koepke, Olivia Wiles , Yael Moses, Andrew Zisserman
Institution: University of Oxford; The Interdisciplinary Center

[2020] Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington

[ICASSP-2021] Collaborative Learning to Generate Audio-Video Jointly
Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi
Institution: Indian Institute of Technology Kanpur; University of Bath; Indian Institute of Technology Bombay

[ACM-2021] Video Background Music Generation with Controllable Music Transformer
Authors: Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan
Institution: Beihang University; Charterhouse School, Godalming, Surrey; Sea AI Lab

[2022] Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation
Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia
Institution: New York University, Shanghai; Queen Mary University of London; Tencent Inc.; Mohamed bin Zayed University of Artificial Intelligence

[CVPR-2023] Conditional Generation of Audio from Video via Foley Analogies
Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens
Institution: University of Michigan; Yale University; Adobe Research

[ICML-2023] Long-Term Rhythmic Video Soundtracker
Authors: Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao
Institution: Shanghai Artificial Intelligence Laboratory

Natural Sound

[CVPR-2016] Visually Indicated Sounds
Authors: Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman
Institution: Massachusetts Institute of Technology; U.C. Berkeley; Google Research

[CVPR-2018] Visual to Sound: Generating Natural Sound for Videos in the Wild
Authors: Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg
Institution: University of North Carolina at Chapel Hill; Adobe Research

[IEEE TIP-2020] Generating Visually Aligned Sound From Videos
Authors: Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan,
Institution: South China University of Technology; China Pazhou Laboratory; MIT-IBM Watson AI Lab

[BMVC-2021] Taming Visually Guided Sound Generation
Authors: Vladimir Iashin, Esa Rahtu
Institution: Tampere University

[IEEE TCSVT-2022] Towards an End-to-End Visual-to-Raw-Audio Generation With GAN
Authors: Shiguang Liu; Sijia Li; Haonan Cheng
Institution: Tianjin University

[ICASSP-2023] I Hear Your True Colors: Image Guided Audio Generation
Authors: Roy Sheffer, Yossi Adi
Institution: The Hebrew University of Jerusalem

[CVPR-2023] Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
Institution: University of Washington; MIT-IBM Watson AI Lab; MIT; UMass Amherst

Spatial Sound Generation

[ACM TOG-2018] Scene-aware audio for 360° videos
Authors: Dingzeyu Li, Timothy R.Langlois, Changxi Zheng
Institution: Columbia University; Adobe Research

[NeurIPS-2018] Self-Supervised Generation of Spatial Audio for 360° Video
Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang
Institution: University of California San Diego; Adobe Research

[CVPR-2019] 2.5D Visual Sound
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

[ICIP-2019] Self-Supervised Audio Spatialization with Correspondence Classifier
Authors: Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, Ming-Hsuan Yang
Institution: University of California at Merced

[ECCV-2020] Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong

[CVPR-2021] Visually Informed Binaural Audio Generation without Binaural Audios
Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
Institution: The Chinese University of Hong Kong; Nanyang Technological University

[AAAI-2021] Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services

[TOG-2021] Binaural Audio Generation via Multi-task Learning
Authors: Sijia Li, Shiguang Liu, Dinesh Manocha
Institution: Tianjin University; University of Maryland at College Park

[WACV-2022] Beyond Mono to Binaural: Generating Binaural Audio From Mono Audio With Depth and Cross Modal Attention
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; CDAC Noida; TensorTour Inc.

[CVPR-2023] Novel-View Acoustic Synthesis
Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
Institution: University of Texas at Austin; Meta AI

Video Generation

talking face

[ACM TOG-2017] Synthesizing Obama: learning lip sync from audio
Authors: Supasorn Suwajanakorn, Steven Maxwell Seitz, Ira Kemelmacher-Shlizerman
Institution: University of Washington

[ECCV-2018] Lip Movements Generation at a Glance
Authors: Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester

[IJCV-2019] You Said That?: Synthesising Talking Faces from Audio
Authors: Amir Jamaludin, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford

[ICCV-2019] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Authors: Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky
Institution: Samsung AI Center; Skolkovo Institute of Science and Technology

[IJCV-2020] Realistic Speech-Driven Facial Animation with GANs
Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Research Centre Cambridge

[IJCV-2020] GANimation: One-Shot Anatomically Consistent Facial Animation
Authors: Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer
Institution: Institut de Robòtica i Informàtica Industrial; The Ohio State University

[ACM TOG-2020] Makelttalk: Speaker-Aware Talking-Head Animation
Authors: Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, Dingzeyu Li
Institution: University of Massachusetts Amherst; Huya Inc.; Adobe Research

[CVPR-2020] FReeNet: Multi-Identity Face Reenactment
Authors: Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, Changjie Fan
Institution: Zhejiang University; Fuxi AI Lab

[ECCV-2020] Neural Voice Puppetry: Audio-driven Facial Reenactment
Authors: Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, Matthias Nie?ner
Institution: Technical University of Munich; Saarland Informatics Campus

[CVPR-2020] Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images
Authors: Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang
Institution: The Chinese University of Hong Kong; SenseTime Research

[ECCV-2020] MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation
Authors: Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, Chen Change Loy
Institution: SenseTime Research; Carnegie Mellon University; Center for Research on Intelligent Perception and Computing, CASIA; University of Chinese Academy of Sciences; Shenzhen Institutes of Advanced Technology, Chinese Academy of Science; Nanyang Technological University

[AAAI-2021] Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation
Authors: Lincheng Li, Suzhen Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan
Institution: Netease Fuxi AI Lab; University of Technology Sydney

[CVPR-2021] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Authors: Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong; SenseTime Research; Tokyo Institute of Technology; Nanyang Technological University

[CVPR-2021] Audio-Driven Emotional Video Portraits
Authors: Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu
Institution: Nanjing University; The Chinese University of Hong Kong; The University of Sydney; SenseTime Research; Nanyang Technological University; Tsinghua University

[AAAI-2022] One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
Authors: Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
Institution: Netease Fuxi AI Lab; University of Technology Sydney

[TVCG-2022] Generating talking face with controllable eye movements by disentangled blinking feature
Authors: Shiguang Liu, Jiaqi Hao
Institution: Tianjin University

[AAAI-2022] SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology

[CVPR-2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers
Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
Institution: The University of Hong Kong; The Hong Kong University of Science and Technology; Adobe Research; Texas A&M University

[CVPR-2023] Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Authors: Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li
Institution: National University of Singapore; University of Science and Technology Beijing; University of Electronic Science and Technology of China; The Chinese University of Hong Kong

[ICASSP-2023] Free-View Expressive Talking Head Video Editing
Authors: Yuantian Huang, Satoshi Iizuka, Kazuhiro Fukui
Institution: University of Tsukuba

[ICASSP-2023] Audio-Driven Facial Landmark Generation in Violin Performance using 3DCNN Network with Self Attention Model
Authors: Tingwei Lin, Chaolin Liu, Li Su
Institution: Taiwan International Graduate Program; Academia Sinica; National Chengchi University

[ICASSP-2023] Naturalistic Head Motion Generation from Speech
Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald
Institution: University of Maryland; Apple Inc.

[ICASSP-2023] Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound
Authors: Valentina Sanguineti, Sanket Thakur, Pietro Morerio, Alessio Del Bue, Vittorio Murino
Institution: Istituto Italiano di Tecnologia; University of Genova

[CVPR-2023] Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Authors: Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, Guanbin Li
Institution: Sun Yat-sen University; Xidian University; The University of Hong Kong

[CVPR-2023] SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Authors: Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, Fei Wang
Institution: Xi’an Jiaotong University; National Key Laboratory of Human-Machine Hybrid Augmented Intelligence; Tencent AI Lab; Ant Group

[ACM MM-2023] Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline
Authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng
Institution: Du Xiaoman Financial; Shanghai Jiao Tong University

[CVPR-2023] LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook
Authors: Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
Institution: Alibaba Group

[CVPR-2024] Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel
Institution: KarlsruheInstitute of Technology; Istanbul Technical University; Carnegie Mellon University

[InterSpeech-2024] Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert
Authors: Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Joo, Tae-Hyun Oh
Institution: Grad. School of Artificial Intelligence and Dept. of Electrical Engineering, POSTECH, Korea; ENSC, Bordeaux INP, France; KRAFTON, Korea; Inst. for Convergence Research and Education in Advanced Technology, Yonsei University, Korea

[ACM MM-2024] ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
Authors: Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu
Institution: BNRist, DCST, Tsinghua University; Baidu Inc.; Zhongguancun Laboratory; S-Lab, Nanyang Technological University

[ECCV-2024] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Authors: Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang
Institution: South China University of Technology; Technical University of Munich; Pazhou Laboratory;

Gesture

[IVA-2018] Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
Authors: Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, Kazuhiko Sumi
Institution: Hokkai Gakuen University Sapporo; Aoyama Gakuin University; Yokohama National University

[IVA-2019] Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology in Stockholm; Hokkai Gakuen University; Aoyama Gakuin University;

[CVPR-2019] Learning Individual Styles of Conversational Gesture
Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik
Institution: University of California, Berkeley; Zebra Medical Vision; Massachusetts Institute of Technology

[ICMI-2019] To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations,
Authors: Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, Yaser Sheikh
Institution: Carnegie Mellon University; Facebook Reality Labs

[EUROGRAPHICS-2020] Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows
Authors: Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow
Institution: KTH Royal Institute of Technology

[ICMI-2020] Gesticulator: A Framework For Semantically-Aware Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology

[2020] Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency
Institution: Carnegie Mellon University; Seikei University

[ACM TOG-2020] Speech Gesture Generation From The Trimodal Context Of Text, Audio, And Speaker Identity
Authors: Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee
Institution: Korea Advanced Institute of Science and Technology; University of Science and Technology; Electronics and Telecommunications Research Institute

[CVPR-2022] SEEG: Semantic Energized Co-Speech Gesture Generation
Authors: Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang
Institution: Alibaba; University of Technology Sydney; Zhejiang University

[IEEE TNNLS-2022] VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation
Authors: Wangli Hao; He Guan; Zhaoxiang Zhang
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences

[CVPR-2023] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Authors: Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu
Institution: The University of Hong Kong; The Chinese University of Hong Kong; Nanyang Technological University

[IJCAI-2024] Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model
Authors: Wentao Lei, Li Liu, Jun Wang
Institution: The Hong Kong University of Science and Technology (Guangzhou); Tencent AI Lab; The Hong Kong University of Science and Technology

Dance

[ACM MM-2018] Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis
Authors: Taoran Tang, Jia Jia, Hanyang Mao
Institution: Tsinghua University

[CVPR-2018] Audio to Body Dynamics
Authors: Eli Shlizerman, Lucio Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman
Institution: Facebook Inc.; Stanford University; University of Washington

[NeurIPS-2019] Dancing to Music
Authors: Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, Jan Kautz
Institution: University of California; NVIDIA

[ICLR-2021] Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
Authors: Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang
Institution: Fudan University; Microsoft STCA; Meituan; Rinna AI

[ICCV-2021] AI Choreographer: Music Conditioned 3D Dance Generation With AIST++
Authors: Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa
Institution: University of Southern California; Google Research; University of California, Berkeley

[ICASSP-2022] Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
Authors: Yuhang Huang, Junjie Zhang, Shuyan Liu, Qian Bao, Dan Zeng, Zhineng Chen, Wu Liu
Institution: Shanghai University; University of Chinese Academy of Sciences; JD AI Research; Fudan University

[CVPR-2022] Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
Authors: Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Chang Loy, Ziwei Liu
Institution: Nanyang Technological University; Sun Yat-Sen University; University of California, Los Angeles; SenseTime Research

[CVPR-2023] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Authors: Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
Institution: Renmin University of China; Peking University; Microsoft Research

[IEEE TMM-2023] Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization
Authors: Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan
Institution: Shanghai Key Lab of Intelligent Information Processing; Fudan University; Tencent

[VCJ-2024] QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation
Authors: Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li
Institution: Guangdong University of Technology, Guangdong, China; Huizhou University, Guangdong, China; Guangdong Mechanical and Electrical College, Guangdong, China; University of Western Australia, WA, Australia

Image Manipulation

[2021] Sound-guided semantic image manipulation Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young Kim, Jinkyu Kim, Sangpil Kim Korea University; Korea Advanced Institute of Science and Technology; NVIDIA Corp.

[2022] Learning visual styles from audio-visual associations Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute

[CVPR-2023] Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh
Institution: Pohang University of Science and Technology; Korea Advanced Institute of Science and Technology; University of Michigan; Yonsei University

[ACM MM-2024] An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation
Authors: Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo
Institution: Beijing Institute of Technology, Beijing, China; Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China;Renmin University of China, Beijing, China

Depth Estimation

[ICRA-2020] BatVision: Learning to See 3D Spatial Layout with Two Ears
Authors: Jesper Haahr Christensen; Sascha Hornauer; Stella X. Yu
Institution: Technical University of Denmark; University of California

[ECCV-2020] VISUALECHOES: Spatial Image Representation Learning Through Echolocation
Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Institution: The University of Texas at Austin; Facebook Reality Lab; Facebook AI Research

[CVPR-2021] Beyond Image to Depth: Improving Depth Prediction Using Echoes
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; Centre for Development of Advanced Computing Noida; TensorTour Inc.

[ICASSP-2022] Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation
Authors: Go Irie, Takashi Shibata, Akisato Kimura
Institution: Nippon Telegraph & Telephone Corporation

[NeurIPS-2022] Learning Neural Acoustic Fields
Authors: Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, Chuang Gan
Institution: Carnegie Mellon University; Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[NeurIPS-2022] Few-Shot Audio-Visual Learning of Environment Acoustics
Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

Audio-visual Transfer Learning

[NeurIPS-2016] SoundNet: Learning Sound Representations from Unlabeled Video
Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba
Institution: Massachusetts Institute of Technology

[ICCV-2019] Self-Supervised Moving Vehicle Tracking With Stereo Sound
Authors: Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; IBM Research AI

[CVPR-2021] There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge
Authors: Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada
Institution: University of Freiburg

[AAAI-2021] Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning
Authors: Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, Roger Zimmermann
Institution: National University of Singapore; National University of Singapore Northwestern Polytechnical University; Zhejiang Gongshang University; Indraprastha Institute of Information Technology, Delhi

[Interspeech-2021] Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
Authors: Leying Zhang, Zhengyang Chen, Yanmin Qian
Institution: Shanghai Jiao Tong University

[ICCV-2021] Multimodal Knowledge Expansion
Authors: Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
Institution: Shanghai Qi Zhi Institute; UT Austin; South China University of Technology; Massachusetts Institute of Technology; Tsinghua University

[CVPR-2021] Distilling Audio-visual Knowledge by Compositional Contrastive Learning
Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
Institution: University of Tubingen; MPI for Informatics; Tencent; Max Planck Institute for Intelligent Systems

[2022] Estimating Visual Information From Audio Through Manifold Learning
Authors: Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, George Tzanetakis, Kwang Moo Yi
Institution: University of British Columbia; Huawei Technologies Canada Co., Ltd; University of Victoria

[DCASE-2021] Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy
Authors: Chengxin Chen, Meng Wang, Pengyuan Zhang
Institution: Institute of Acoustics, CAS; University of Chinese Academy of Sciences

[Interspeech-2021] Audiovisual transfer learning for audio tagging and sound event detection
Authors: Wim Boes, Hugo Van hamme
Institution: ESAT, KU Leuven

[2023] Revisiting Pre-training in Audio-Visual Learning
Authors: Ruoxuan Feng, Wenke Xia, Di Hu
Institution: Hunan University; Renmin University of China

[IJCNN-2023] A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques
Authors: Qichen Zheng, Jie Hong, Moshiur Farazi
Institution: Australian National University; CSIRO Data61

[ICCV-2023] Audio-Visual Class-Incremental Learning
Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
Institution: The University of Texas at Dallas; Carnegie Mellon University

[ICCV-2023] Hyperbolic Audio-visual Zero-shot Learning
Authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson
Institution: Australian National University; CSIRO Data61

[CVPR-2023] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan
Institution: Carnegie Mellon University

[ICCV-2023] Class-Incremental Grouping Network for Continual Audio-Visual Learning
Authors: Shentong Mo, Weiguo Pian, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas

[ICCV-2023] Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Authors: Heeseung Yun, Joonil Na, Gunhee Kim
Institution: Seoul National University

Cross-modal Retrieval

[2017] Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
Authors: Sungeun Hong, Woobin Im, Hyun S. Yang

[ICCV-2017] Image2song: Song Retrieval via Bridging Image Content and Lyric Words
Authors: Xuelong Li, Di Hu, Xiaoqiang Lu
Institution: Chinese Academy of Sciences; Northwestern Polytechnical University

[CVPR-2018] Seeing voices and hearing faces: Cross-modal biometric matching
Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman
Institution: University of Oxford

[ECCV-2018] Cross-modal Embeddings for Video and Audio Retrieval
Authors: Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto
Institution: Universitat Politecnica de Catalunya; Barcelona Supercomputing Center

[ISM-2018] Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics

[TOMCCAP-2020] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics

[IEEE TGRS-2020] Deep Cross-Modal Image–Voice Retrieval in Remote Sensing
Authors: Yaxiong Chen, Xiaoqiang Lu, Shuai Wang
Institution: China University of Chinese Academy of Sciences; Chinese Academy of Sciences

[2021] Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Jianming Wu, Gen Hattori, Yi Yu, Rong Xu
Institution: KDDI Research, Inc.; National Institute of Informatics, SOKENDAI

[ICCV-2021] Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion
Authors: Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, Guang Yang
Institution: Hangzhou Dianzi University; University of California; East China Normal University; University of Oxford; Wuhan University; Imperial College London

[ICMR-2024] Anchor-aware Deep Metric Learning for Audio-visual Retrieval
Authors: Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu
Institution: KDDI Research, Inc.; Hiroshima University

[IJCAI-2022] Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng
Institution: National University of Defense Technology

[IEEE ISM-2022] Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
Authors: Donghuo Zeng, Yanan Wang, Jianming Wu, Kazushi Ikeda
Institution: KDDI Research, Inc.

[IEEE SMC-2022] Graph Network based Approaches for Multi-modal Movie Recommendation System
Authors: Daipayan Chakder, Prabir Mondal, Subham Raj, Sriparna Saha, Angshuman Ghosh, Naoyuki Onoe
Institution: Indian Institute of Technology; Sony Research

[CVPR-2022] Visual Acoustic Matching
Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
Institution: University of Texas at Austin; Stanford University; Meta AI

[IEEE TMM-2023] Deep Cross-Modal Retrieval Between Spatial Image and Acoustic Speech
Authors: Xinyuan Qian, Wei Xue, Qiquan Zhang, Ruijie Tao, Haizhou Li
Institution: University of Science and Technology; Hong Kong University of Science and Technology; University of New South Wales; National University of Singapore; The Chinese University of Hong Kong-Shenzhen

[ICASSP-2024] Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
Authors: Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju
Institution: Dolby Laboratories, University at Buffalo;

Back to Top

Audio-visual Collaboration

Audio-visual Representation Learning

[ICCV-2017] Look, Listen and Learn
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford

[NeurIPS-2018] Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Dartmouth College; Facebook Research

[NeurIPS-2020] Learning Representations from Audio-Visual Spatial Alignment
Authors: Pedro Morgado, Yi Li, Nuno Nvasconcelos
Institution: University of California

[NeurIPS-2020] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
Institution: King Abdullah University of Science and Technology; Facebook AI Research

[NeurIPS-2020] Labelling Unlabelled Videos From Scratch With Multi-Modal Self-Supervision
Authors: Yuki Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi
Institution: University of Oxford; Facebook AI Research

[CVPR-2021] Audio-Visual Instance Discrimination with Cross-Modal Agreement
Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra
Institution: University of California San Diego; Facebook AI Research

[CVPR-2021] Robust Audio-Visual Instance Discrimination
Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos
Institution: University of California San Diego; Facebook AI Research

[2021] Unsupervised Sound Localization via Iterative Contrastive Learning
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; University of California; Snap Inc.; Google Research

[ICCV-2021] Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos
Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Institution: Columbia University; Massachusetts Institute of Technology; University of Central Florida; Goethe University Frankfurt; IBM Research AI; MIT-IBM Watson AI Lab; The University of Texas at Austin; NYU-Courant CS & CDS

[2021] OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Authors: Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
Institution: Chinese Academy of Sciences

[NeurIPS-2021] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Authors: Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
Institution: Columbia University; Google Inc.; Cornell University

[2021] Audio-visual Representation Learning for Anomaly Events Detection in Crowds
Authors: Junyu Gao, Maoguo Gong, Xuelong Li
Institution: Xidian University; Northwestern Polytechnical University

[ICASSP-2022] Audioclip: Extending Clip to Image, Text and Audio
Authors: Andrey Guzhov, Federico Raue, Jorn Hees, Andreas Dengel
Institution: Germany TU Kaiserslautern; Deutsches Forschungszentrum für Künstliche Intelligenz GmbH

[CVPR-2022] MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound
Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Institution: University of Washington; Allen Institute for Artificial Intelligence; University of Edinburgh

[2022] Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning
Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi
Institution: Sensetime Research; The Hong Kong Polytechnic University

[NeurIPS-2022] Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
Institution: Dartmouth College; Northwestern University

[IEEE TMM-2022] Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations
Authors: Sijie Mai, Ying Zeng, Haifeng Hu
Institution: Sun Yat-sen University; National Natural Science Foundation of China

[CVPR-2022] Audiovisual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tübingen; Robert Bosch GmbH; Max Planck Institute

[CVPRW-2022] Multi-task Learning for Human Affect Prediction with Auditory–Visual Synchronized Representation
Authors: Euiseok Jeong;, Geesung Oh, Sejoon Lim
Institution: Kookmin University

[CVPR-2023] Vision Transformers are Parameter-Efficient Audio-Visual Learners
Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
Institution: The University of North Carolina at Chapel Hill

[CVPR-2022] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tubingen; Robert Bosch GmbH; Max Planck Institute

[ECCV-2022] Temporal and cross-modal attention for audio-visual zero-shot learning
Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Institution: University of Tuebingen; Max Planck Institute

[NeurIPS-2022] u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Meta AI

[NeurIPS-2022] Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu
Institution: Texas A&M University; Google Research; University of Texas at Austin; Celonis Inc.

[AAAI-2023] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Authors: Pritam Sarkar, Ali Etemad
Institution: Queen’s University; Vector Institute

[ICLR-2023] Contrastive Audio-Visual Masked Autoencoder
Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass
Institution: Massachusetts Institute of Technology; The University of Texas at Austin; MIT-IBM Watson AI Lab; Goethe University Frankfurt

[ICLR-2023] Jointly Learning Visual and Auditory Speech Representations from Raw Data
Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Meta AI

[WACV-2023] Audio Representation Learning by Distilling Video as Privileged Information
Authors: Amirhossein Hajavi, Ali Etemad
Institution: Queen’s University, Canada

[2023] AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli
Institution: University of California; Meta AI

[AAAI-2023] Audio-Visual Contrastive Learning with Temporal Self-Supervision
Authors: Simon Jenni, Alexander Black, John Collomosse
Institution: Adobe Research; University of Surrey

[CVPR-2023] ImageBind One Embedding Space to Bind Them All
Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Institution: Meta AI

[NeurIPS-2023] Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
Institution: Zhejiang University; Shanghai Artificial Intelligence Laboratory; Huawei Noah’s Ark Lab

[WACV-2024] OmniVec: Learning robust representations with cross modal sharing
Authors: Siddharth Srivastava, Gaurav Sharma
Institution: TensorTour Inc.

[InterSpeech-2024] Zero-Shot Fake Video Detection by Audio-Visual Consistency
Authors: Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang
Institution: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China; Center for Speech and Language Technologies, BNRist, Tsinghua University, China

[ICML-2024] From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: Department of ECE, University of Washington, Seattle, United States; Department of Applied Math, University of Washington, Seattle, United States

Audio-visual Localization

Sound Localization in Videos

[ECCV-2018] Objects that Sound
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford

[ECCV-2018] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Authors: Andrew Owens, Alexei A. Efros
Institution: University of California, Berkeley

[ECCV-2018] The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University

[ICASSP-2019] Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[ICCV-2019] The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[CVPR-2019] Deep Multimodal Clustering for Unsupervised Audiovisual Learning
Authors: Di Hu, Feiping Nie, Xuelong Li
Institution: Northwestern Polytechnical University

[CVPR-2021] Localizing Visual Sounds the Hard Way
Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Institution: University of Oxford

[IEEE TPAMI-2021] Class-aware Sounding Objects Localization via Audiovisual Correspondence
Authors: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
Institution: Renmin University of China; Shanghai Jiao Tong University

[IEEE TPAMI-2021] Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
Institution: Korea Advanced Institute of Science and Technology; Pohang University of Science and Technology; University of California

[CVPR-2022] Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin

[ECCV-2022] Audio-Visual Segmentation
Authors: Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; 7Shanghai Artificial Intelligence Laboratory

[2022] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu
Institution: Meta Reality Labs

[ACM MM-2022] Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Authors: Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang
Institution: Shanghai Jiao Tong University; Shanghai AI Laboratory

[CVPR-2022] Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes
Authors: Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, Zhaoxiang Zhang
Institution: Chinese Academy of Science; University of Chinese Academy of Sciences

[CVPR-2022] Self-supervised object detection from audio-visual correspondence
Authors: Triantafyllos Afouras; Yuki M. Asano; Francois Fagan; Andrea Vedaldi; Florian Metze
Institution: University of Oxford; University of Amsterdam; Meta AI

[EUSIPCO-2022] Visually Assisted Self-supervised Audio Speaker Localization and Tracking
Authors: Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang
Institution: University of Surrey; Tencent AI Lab, Bellevue

[CVPR-2022] Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin

[ICASSP-2023] MarginNCE: Robust Sound Localization with a Negative Margin
Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute, South Korea

[IEEE TMM-2022] Cross modal video representations for weakly supervised active speaker localization
Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
Institution: University of Southern California; Google Inc.

[NeurIPS-2022] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Authors: Shentong Mo, Pedro Morgado
Institution: Carnegie Mellon University; University of Wisconsin-Madison

[AAAI-2022] Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou
Institution: The Chinese University of Hong Kong; Zhejiang University; Shanghai Jiao Tong University; Renmin University of China; Nanyang Technological University

[ECCV-2022] Sound Localization by Self-Supervised Time Delay Estimation
Authors: Ziyang Chen, David F. Fouhey, Andrew Owens
Institution: University of Michigan

[IEEE/ACM TASLP-2023] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Authors: Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, Haizhou Li
Institution: University of Science and Technology Beijing; Chinese University of Hong Kong; Shenzhen Research Institute of Big dataNational University of Singapore; Univeristy of California at Berkeley; University of Bremen

[WACV-2023] Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Authors: Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, Venu Govindaraju
Institution: University at Buffalo

[WACV-2023] Exploiting Visual Context Semantics for Sound Source Localization
Authors: Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang
Institution: The University of Sydney; Renmin University of China; Baidu Inc.

[2023] Audio-Visual Segmentation with Semantics
Authors: Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; University of Oxford; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; Shanghai Artificial Intelligence Laboratory

[CVPR-2023] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute

[CVPR-2023] Egocentric Audio-Visual Object Localization
Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Institution: University of Rochester; Meta Reality Labs Research

[CVPR-2023] Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute

[CVPR-2023] Audio-Visual Grouping Network for Sound Localization from Mixtures
Authors: Shentong Mo, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas

[ICASSP-2023] Flowgrad: Using Motion for Visual Sound Source Localization
Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes
Institution: Universitat Pompeu Fabra; New York University

[ACM MM-2023] Audio-visual segmentation, sound localization, semantic-aware sounding objects localization
Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu
Institution: University of Technology Sydney; The University of Queensland; Futureverse; The Hong Kong University of Science and Technology; CSIRO DATA61; Netease Fuxi AI Lab

[ACM MM-2023] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
Authors: Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang
Institution: Northwestern Polytechnical University; Nanchang University

[ACM MM-2023] Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
Authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim
Institution: Kyung Hee University

[IJCAI-2023] Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beihang University; Alibaba Group

[ICCV-2023] Sound Source Localization is All about Cross-Modal Alignment
Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Harvard University; Pohang University of Science and Technology; Yonsei University

[ICCV-2023] Multimodal Variational Auto-encoder based Audio-Visual Segmentation
Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai
Institution: Northwestern Polytechnical University; Shaanxi Key Laboratory of Information Acquisition and Processing; Australian National University; Shanghai AI Laboratory

[WACV-2024] Can CLIP Help Sound Source Localization?
Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute

[NeurIPS-2023] Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Authors: Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou,Siyang Sun, Yun Zheng
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences, Beijing, China; DAMOAcademy, Alibaba Group

[ICASSP-2024] Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Authors: Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); Institute of Automation of Chinese Academy of Sciences

[CVPR-2024] Audio-Visual Segmentation via Unlabeled Frame Exploitation
Authors: Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang
Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University; Shanghai AI Laboratory

[ACM MM-2024] CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
Authors: Xiang He, Xiangxi Liu, Yang Li, Dongcheng Zhao, Guobin Shen, Qingqun Kong, Xin Yang, Yi Zeng
Institution: Brain-inspired Cognitive Intelligence Lab,Institute of Automation, Chinese Academy of Sciences, Beijing, China; Center for Long-term Artificial Intelligence, Beijing, China; Key Laboratory of Brain Cognition and Brain-inspired, Intelligence Technology, CAS, Shanghai, China; Institute of Automation, Chinese Academy of Sciences Beijing, China;

[ACM MM-2024] Open-Vocabulary Audio-Visual Semantic Segmentation
Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Institution: National Key Laboratory of General, Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China; Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Berkeley AI Research, University of California, Berkeley, Berkeley, CA, USA; College of Information and Electrical Engineering, China Agricultural University, Beijing, China;

[ECCV-2024] Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Authors: Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock
Institution: Seoul National University; Reality Labs Research at Meta

Audio-visual Saliency Detection

[2019] DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
Authors: Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala
Institution: Aalto University; Tampere University

[CVPR-2020] STAViS: Spatio-Temporal AudioVisual Saliency Network
Authors: Antigoni Tsiami, Petros Koutras, Petros Maragos
Institution: National Technical University of Athens

[IEEE TIP-2020] A Multimodal Saliency Model for Videos With High Audio-visual Correspondence
Authors: Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, Xinping Guan
Institution: Shanghai Jiao Tong University; University of Macau; Ryerson University

[IROS-2021] ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction
Authors: Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi
Institution: International Institute for Information Technology; University of Canberra

[CVPR-2021] From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach
Authors: Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin
Institution: Beihang University; Qingdao University; Chinese Academy of Medical Sciences

[ICME-2021] Lavs: A Lightweight Audio-Visual Saliency Prediction Model
Authors: Dandan Zhu; Defang Zhao; Xiongkuo Min; Tian Han; Qiangqiang Zhou; Shaobo Yu; Yongqing Chen; Guangtao Zhai; Xiaokang Yang
Institution: Shanghai Jiao Tong University; Tongji University; Stevens Institute of Technology; Jiangxi Normal University; East China Normal University; Hainan University

[2022] A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!
Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian
Institution: China University of Petroleum; Shandong University of Finance and Economics; Beijing Information Science and Technology University

[TOMCCAP-2022] PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection
Authors: Yi Zhang, Fang-Yi Chao, Wassim Hamidouche, Olivier Deforges
Institution: University Rennes; Institut National des Sciences Appliquées Rennes; Centre national de la recherche scientifique; Trinity College Dublin

[CVPR-2023] CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective
Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University

[CVPR-2023] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Authors: Chao Feng, Ziyang Chen, Andrew Owens
Institution: University of Michigan

[CVPR-2023] CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective
Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University

[IJCNN-2023] 3DSEAVNet: 3D-Squeeze-and-Excitation Networks for Audio-Visual Saliency Prediction
Authors: Silong Liang, Chunxiao Li, Naying Cui, Minghui Sun, Hao Xue
Institution: JiLin University

[IEEE TMM-2023] SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention
Authors: Qin Yang, Yuqi Li, Chenglin Li, Hao Wang, Sa Yan, Li Wei, Wenrui Dai, Junni Zou, Hongkai Xiong, Pascal Frossard
Institution: Shanghai Jiao Tong University; École Polytechnique Fédérale de Lausanne

[TMM-2023] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
Authors: Dandan Zhu, Kaiwei Zhang, Nana Zhang, Qiangqiang Zhou, Xiongkuo Min, Guangtao Zhai, Xiaokang Yang
Institution: Institute of AI Education, Shanghai, East China Normal University;Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University; School of Computer Science and Technology, Donghua University; School of Software, Jiangxi Normal University; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

[CVPR-2024] DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
Authors: Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University

Audio-visual Navigation

[ECCV-2020] SoundSpaces: Audio-Visual Navigation in 3D Environments
Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman
Institution: The University of Texas at Austin; University of Illinois at Urbana-Champaign; Facebook Reality Labs; Facebook AI Research

[ICRA-2020] Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
Authors: Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum
Institution: MIT-IBM Watson AI Lab; Tsinghua University; Massachusetts Institute of Technology; Google Inc.

[ICLR-2021] Learning to Set Waypoints for Audio-Visual Navigation
Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

[CVPR-2021] Semantic Audio-Visual Navigation
Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

[ICCV-2021] Move2Hear: Active Audio-Visual Source Separation
Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research

[2022] Sound Adversarial Audio-Visual Navigation
Authors: Yinfeng Yu, Wenbing Huang, Fuchun Sun, Changan Chen, Yikai Wang, Xiaohong Liu
Institution: Tsinghua University; Xinjiang University; The University of Texas at Austin; JD Explore Academy

[CVPR-2022] Towards Generalisable Audio Representations for Audio-Visual Navigation
Authors: Shunqi Mao, Chaoyi Zhang, Heng Wang, Weidong Cai
Institution: University of Sydney

[NeurIPS-2022] SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul
Institution: The University of Texas at Austin; Reality Labs at Meta; Georgia Tech; Meta AI

[NeurIPS-2022] AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
Authors: Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian
Institution: University of California; Mitsubishi Electric Research Labs, Cambridge

[BMVC-2022] Pay Self-Attention to Audio-Visual Navigation
Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, Liejun Wang
Institution: Tsinghua University; Motherbrain, EQT; Xinjiang University

[CVPR-2022] Finding Fallen Objects Via Asynchronous Audio-Visual Integration
Authors: Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab

[CVPR-2022] ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer
Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu
Institution: Stanford Univeristy; Carnegie Mellon University

[IEEE RAL-2023] Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds
Authors: Abdelrahman Younes, Daniel Honerkamp, Tim Welschehold, Abhinav Valada
Institution: University of Freiburg

[2023] Audio Visual Language Maps for Robot Navigation
Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
Institution: University of Freiburg; Google Research; University of Technology Nuremberg

[ICCV-2023] Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation
Authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang
Institution: Beihang University; Zhejiang University; The Chinese University of Hong Kong

[IROS-2024] Audio-Visual Traffic Light State Detection for Urban Robots
Authors: Sagar Gupta, Akansel Cosgun
Institution: Deakin University, Australia

Audio-visual Event Localization and Parsing

Localization

[ECCV-2018] Audio-visual Event Localization in Unconstrained Videos
Authors: Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester

[ICASSP-2019] Dual-modality Seq2Seq Network for Audio-visual Event Localization
Authors: Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang
Institution: National Taiwan University

[ICCV-2019] Dual Attention Matching for Audio-Visual Event Localization
Authors: Yu Wu, Linchao Zhu, Yan Yan, Yi Yang
Institution: Baidu Research; University of Technology Sydney; Texas State University

[AAAI-2020] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan
Institution: Nanjing University of Science and Technology

[ACCV-2020] Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services

[WACV-2021] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Authors: Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Trento; Texas State University

[CVPR-2021] Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang
Institution: Hefei University of Technology; Intelligent Interconnected Systems Laboratory of Anhui Province; Australian National University

[AIKE-2021] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Authors: Qiurui Yue; Xiaoyu Wu; Jiayi Gao
Institution: Communication University of China

[TMM-2021] Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention
Authors: Cheng Xue, Xionghu Zhong, Minjie Cai, Hao Chen, Wenwu Wang
Institution: Hunan University; United Kingdom of Great Britain and Northern Ireland

[CVPR-2022] Cross-Modal Background Suppression for Audio-Visual Event Localization
Authors: Yan Xia, Zhou Zhao
Institution: Zhejiang University

[ICASSP-2022] Bi-Directional Modality Fusion Network For Audio-Visual Event Localization
Authors: Shuo Liu; Weize Quan; Yuan Liu; Dong-Ming Yan
Institution: Chinese Academy of Sciences; Alibaba Group

[ICSIP-2022] Audio-Visual Event and Sound Source Localization Based on Spatial-Channel Feature Fusion
Authors: Xiaolong Zheng, Ying Wei
Institution: Shandong University

[IJCNN-2022] Look longer to see better: Audio-visual event localization by exploiting long-term correlation
Authors: Longyin Guo, Qijun Zhao, Hongmei Gao
Institution: Sichuan University; Tibet University

[EUSIPCO-2022] Audio Visual Graph Attention Networks for Event Detection in Sports Video
Authors: Taichi Ishiwatari, Makiko Azuma, Takuya Handa, Masaki Takahashi, Takahiro Mochizuki, Masanori Sano
Institution: Science and Technology Research Laboratories, NHK; Tokyo Institute of Technology

[IEEE TPAMI-2022] Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Dan Guo, Meng Wang
Institution: Hefei University of Technology

[IEEE TPAMI-2022] Semantic and Relation Modulation for Audio-Visual Event Localization
Authors: Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo
Institution: University of Science and Technology of China; Chinese Academy of Sciences; University of Rochester

[WACV-2023] AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Authors: Tanvir Mahmud, Diana Marculescu
Institution: The University of Texas at Austin

[WACV-2023] Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding
Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, In So Kweon
Institution: Korea Advanced Institute of Science & Technology; Harvard University; Pohang University of Science and Technology; Adobe Research

[ICASSP-2023] A dataset for Audio-Visual Sound Event Detection in Movies
Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan
Institution: University of Southern California

[CVPR-2023] Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Authors: Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
Institution: Southern University of Science and Technology; University of Birmingham; The University of Hong Kong; Shandong University; Peng Cheng Laboratory

[CVPR-2023] Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies
Authors: Bei Gan, Xiujun Shu, Ruizhi Qiao, Haoqian Wu, Keyu Chen, Hanjun Li, Bo Ren
Institution: Tencent YouTu Lab

[ICASSP-2023] Collaborative Audio-Visual Event Localization Based on Sequential Decision and Cross-Modal Consistency
Authors: Yuqian Kuang, Xiaopeng Fan
Institution: Harbin Institute of Technology; PengCheng Lab

[CVPR-2023] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory

[IJCNN-2023] Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization
Authors: Jinqiao Dou, Xi Chen, Yuehai Wang
Institution: Zhejiang University

[AAAI-2023] Furnishing Sound Event Detection with Language Model Abilities
Authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing Jiaotong University

[IEEE TMM-2023] Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
Authors: Yuanyuan Jiang, Jianqin Yin, Yonghao Dang
Institution: Beijing University of Posts and Telecommunications

[CVPR-2024] T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
Authors: Tanvir Mahmud, Yapeng Tian, Diana Marculescu
Institution: University of Texas at Austin

[ICME-2024] Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios
Authors: Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee
Institution: University of Science and Technology of China; iFlytek Research; Georgia Institute of Technology

[ECCV-2024] Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Authors: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
Institution: Gaoling School of Artificial Intelligence, Renmin University of China, China; Beijing University of Posts and Telecommunications, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China; Engineering Research Center of Next-Generation Search and Recommendation

[ECCV-2024] Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Authors: Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu
Institution: University of Chinese Academy of Sciences; Beijing University of Posts and Telecommunications; Gaoling School of Artificial Intelligence, Renmin University of China, China; Engineering Research Center of Next-Generation Search and Recommendation

[ACM MM-2024] Unveiling and Mitigating Bias in Audio Visual Segmentation
Authors: Peiwen Sun, Honggang Zhang, Di Hu
Institution: Beijing University of Posts and Telecommunications, Beijing, China; Renmin University of China, Beijing, China

[ICASSP-2025] A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio
Authors: Xavier Juanola, Gloria Haro, Magdalena Fuentes
Institution: Universitat Pompeu Fabra, Barcelona, Spain; MARL-IDM, New York University, New York, USA;

Parsing

[ECCV-2020] Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Authors: Yapeng Tian, Dingzeyu Li, Chenliang Xu
Institution: University of Rochester; Adobe Research

[CVPR-2021] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yu Wu, Yi Yang
Institution: Baidu Research; University of Technology Sydney

[NeurIPS-2021] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; UNC Chapel Hill; University of California, Merced; Snap Research; Google Research; Yonsei University

[2022] Investigating Modality Bias in Audio Visual Video Parsing
Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan
Institution: Indian Institute of Technology

[ICASSP-2022] Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
Authors: Penghong Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan
Institution: Harbin Institute of Technology; Wireless Technology Lab

[ECCV-2022] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Authors: Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime Research; The Chinese University of Hong Kong; Shanghai AI Laboratory

[NeurIPS-2022] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Authors: Shentong Mo, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas

[2023] Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
Institution: Hefei University of Technology; Shanghai AI Lab

[ICASSP-2023] CM-CS: Cross-Modal Common-Specific Feature Learning For Audio-Visual Video Parsing
Authors: Hongbo Chen, Dongchen Zhu, Guanghui Zhang, Wenjun Shi, Xiaolin Zhang, Jiamao Li
Institution: Chinese Academy of Sciences; ShanghaiTech University; University of Chinese Academy of Sciences

[2023] Towards Long Form Audio-visual Video Understanding
Authors: Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu
Institution: Renmin University of China; The University of Texas at Dallas

[CVPR-2023] Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio- Visual Event Perception
Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory

[ACM MM-2023] TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
Authors: Meng Liu, Ke Liang, Dayu Hu, Hao Yu, Yue Liu, Lingyuan Meng, Wenxuan Tu, Sihang Zhou, Xinwang Liu
Institution: National University of Defense Technology

[WACV-2024] Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Authors: Yating Xu, Conghui Hu, Gim Hee Lee
Institution: National University of Singapore

[ECCV-2024] Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Authors: Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng Wang
Institution: Hefei University of Technology; Anhui Zhonghuitong Technology Co., Ltd.; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Northwestern Polytechnical University; Shanghai AI Laboratory; University of Science and Technology of China; MBZUAI

Audio-visual Question Answering and Dialog

Question Answering

[ICCV-2021] Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos
Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
Institution: Seoul National University; Allen Institute for AI; University of Oxford; Hyundai Motor Company

[CVPR-2022] Learning To Answer Questions in Dynamic Audio-Visual Scenarios
Authors: Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
Institution: Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods; University of Rochester

[NeurIPS-2022] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji
Institution: University of Illinois at Urbana-Champaign; MSR; The University of North Carolina at Chapel Hill; Columbia University

[ACM MM-2023] Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Authors: Guangyao Li, Wenxuan Hou, Di Hu
Institution: Renmin Uniiversity of China

[WACV-2024] CAD – Contextual Multi-modal Alignment for Dynamic AVQA
Authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa
Institution: University of Surrey; BBC Research and Development

[AAAI-2024] Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang
Institution: School of Computer Science and Information Engineering, Hefei University of Technology; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

[InterSpeech-2024] Towards Multilingual Audio-Visual Question Answering
Authors: Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma
Institution: IIIT-Delhi, India; Reliance Jio AICoE, Hyderabad, India; University of Tartu, Estonia;

[ECCV-2024] Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Authors: Kyu Ri Park, Hong Joo Lee, Jung Uk Kim
Institution: Kyung Hee University, Yong-in, South Korea; Technical University of Munich, Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany

[ACM MM-2024] Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Authors: Guangyao Li, Henghui Du, Di Hu
Institution: GSAI, Renmin University of China, Beijing, China;

Dialog

[CVPR-2019] Audio Visual Scene-Aware Dialog
Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Institution: Georgia Institute of Technology; Mitsubishi Electric Research Laboratories

[Interspeech-2019] Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog
Authors: Hori, C.; Cherian, A.; Marks, T.; Hori, T.
Institution: Mitsubishi Electric Research Laboratories, Inc.

[ICASSP-2019] End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh
Institution: Mitsubishi Electric Research Laboratories; Georgia Institute of Technology

[CVPR-2019] A Simple Baseline for Audio-Visual Scene-Aware Dialog
Authors: Idan Schwartz, Alexander G. Schwing, Tamir Hazan
Institution: Technion; University of Illinois at Urbana-Champaign

[CVPR-2019] Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman
Institution: Anticipatory Computing Lab

[2020] TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Authors: Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li
Institution: Didi Chuxing

[AAAI-2021] Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
Institution: Rutgers University; The Chinese University of Hong Kong; University of Illinois at Urbana Champaign; Mitsubishi Electric Research Laboratories

[2021] VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
Institution: Columbia University; Facebook AI; Georgia Tech; Dartmouth

[ICASSP-2022] Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning
Authors: Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori
Institution: Mitsubishi Electric Research Laboratories; Carnegie Mellon University; Rutgers University; The Chinese University of Hong Kong

[WACV-2022] QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
Authors: Muchao Ye;Quanzeng You;Fenglong Ma
Institution: University Park; Microsoft Azure Computer Vision

[TACL-2022] Learning English with Peppa Pig
Authors: Mitja Nikolaus, Afra Alishahi, Grzegorz Chrupała
Institution: Aix-Marseille University; Tilburg University

[2022] End-to-End Multimodal Representation Learning for Video Dialog
Authors: Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Institution: Georgia Institute of Technology

[AAAI-2022] Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations
Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima
Institution: Nippon Telegraph and Telephone Corporation

[IEEE/ACM TASLP-2023] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
Authors: Zhe Chen, Hongcheng Liu, Yu Wang
Institution: Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory

[IEEE-ACM T AUDIO SPE-2023] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
Authors: Zhe Chen, Hongcheng Liu, Yu Wang
Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University

Datasets

Dataset Year Videos Length Data form Video source Task
LRW, LRS2 and LRS3 2016,2018, 2018 - 800h+ video in the wild Speech-related, speaker-related,face generation-related tasks
VoxCeleb, VoxCeleb2 2017, 2018 - 2,000h+ video YouTube Speech-related, speaker-related,face generation-related tasks
AVA-ActiveSpeaker 2019 - 38.5h video YouTube Speech-related task, speaker-related task
Kinetics-400 2017 306,245 850h+ video YouTube Action recognition
EPIC-KITCHENS 2018 39,594 55h video Recorded videos Action recognition
CMU-MOSI 2016 2,199 2h+ video YouTube Emotion recognition
CMU-MOSEI 2018 23,453 65h+ video YouTube Emotion recognition
VGGSound 2020 200k+ 550h+ video YouTube Action recognition, sound localization
AudioSet 2017 2M+ 5,800h+ video YouTube Action recognition, sound sepearation
Greatest Hits 2016 977 9h+ video Recorded videos Sound generation
MUSIC 2018 714 23h+ video YouTube Sound seperation, sound localization
FAIR-Play 2019 1,871 5.2h video with binaural sound Recorded videos Spatial sound generation
YT-ALL 2018 1,146 113.1h 360 video YouTube Spatial sound generation
Replica 2019 - - 3D environment 3D simulator Depth estimation
AIST++ 2021 - 5.2h 3D video Recorded videos Dance generation
TED 2019 - 52h video TED talks Gesture generation
SumMe 2014 25 1h+ video with eye-tracking User videos Saliency detection
AVE 2018 4,143 11h+ video YouTube Event localization
LLP 2020 11,849 32.9h video YouTube Event parsing
SoundSpaces 2020 - - 3D environment 3D simulator Audio-visual navigation
AVSD 2019 11,816 98h+ video with dialog Crowd-sourced Audio-visual dialog
Pano-AVQA 2021 5.4k 7.7h 360 video with QA Video-sharing platforms Audio-visual question answering
MUSIC-AVQA 2022 9,288 150h+ video with QA YouTube Audio-visual question answering
AVSBench 2022 5,356 14.8h+ video YouTube Audio-visual segmentation, sound localization
RAF 2024 - 95h+ 3D environment Recorded videos Spatial Sound Generation
SPD 2024 - 3.0h Multi-view video Recorded videos Action Recognition
VoxBlink2 2024 2,097,062 16672h video YouTube Speaker
Identification