Overview
This is a curated list of audio-visual learning methods and datasets, based on our survey: <Learning in Audio-visual Context: A Review, Analysis, and New Perspective>. This list will continue to be updated, please feel free to nominate good related works with Pull Requests!
[Website of Our Survey], [arXiv]
Table of contents
- Overview
- Table of contents
Audio-visual Boosting
Audio-visual Recognition
Speech Recognition
[Applied Intelligence-2015]
Audio-visual Speech Recognition Using Deep Learning
Authors: Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata
Institution: Waseda University; Kyoto University; Honda Research Institute Japan Co., Ltd.
[CVPR-2016]
Temporal Multimodal Learning in Audiovisual Speech Recognition
Authors: Di Hu, Xuelong Li, Xiaoqiang Lu
Institution: Northwestern Polytechnical University; Chinese Academy of Sciences
[AVSP-2017]
End-To-End Audiovisual Fusion With LSTMs
Authors: Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
Institution: Imperial College London; University of Twente
[IEEE TPAMI-2018]
Deep Audio-visual Speech Recognition
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
Institution: University of Oxford; Google Inc.
[2019]
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
Authors: Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun
Institution: Peking University
[IEEE TNNLS-2022]
Multimodal Sparse Transformer Network for Audio-visual Speech Recognition
Authors: Qiya Song, Bin Sun, Shutao Li
Institution: Hunan University
[Interspeech-2022]
Robust Self-Supervised Audio-visual Speech Recognition
Authors: Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Institution: Toyota Technological Institute at Chicago; Meta AI
[2022]
Bayesian Neural Network Language Modeling for Speech Recognition
Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng
Institution: the Chinese University of Hong Kong
[Interspeech-2022]
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology; Genesis Lab Inc.
[MLSP-2022]
Rethinking Audio-visual Synchronization for Active Speaker Detection
Authors: Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang
Institution: Tsinghua University; Beijing National Research Center for Information Science and Technology; University of Rochester
[NeurIPS-2022]
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Toyota Technological Institute at Chicago
[ITOEC-2022]
FSMS: An Enhanced Polynomial Sampling Fusion Method for Audio-Visual Speech Recognition
Authors: Chenghan Li; Yuxin Zhang; Huaichang Du
Institution: Communication University of China
[IJCNN-2022]
Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
Authors: Julius Richter; Jeanine Liebold; Timo Gerkamnn
Institution: Universität Hamburg
[ICIP-2022]
Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition
Authors: Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
Institution: University of Science and Technology of China; Chinese Academy of Sciences; iFLYTEK Co., Ltd.
[ICASSP-2023]
Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Authors: Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; iFLYTEK Co. Ltd.
[CVPR-2022]
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Authors: Dan Oneaţă, Horia Cucu
Institution: University POLITEHNICA of Bucharest
[AAAI-2022]
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[AAAI-2023]
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
Institution: Nanyang Technological University; ZJU-Hangzhou Global Scientific and Technological Innovation Center; Zhejiang University
[WACV-2023]
Audio-Visual Efficient Conformer for Robust Speech Recognition
Authors: Maxime Burchi, Radu Timofte
Institution: University of Würzburg
[2023]
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute
[2023]
Multimodal Speech Recognition for Language-Guided Embodied Agents
Authors: Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason
Institution: University of Southern California
[2023]
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang
Institution: Meta AI
[ICASSP-2023]
The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge
Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen
Institution: Northwestern Polytechnical University
[CVPR-2023]
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[ICASSP-2023]
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Authors: Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Meta AI
[CVPR-2023]
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Institution: Google Research
[CVPR-2023]
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Institution: University of Surrey; Meta AI
[ICASSP-2023]
Multi-Temporal Lip-Audio Memory for Visual Speech Recognition
Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[ICASSP-2023]
On the Role of LIP Articulation in Visual Speech Perception
Authors: Zakaria Aldeneh, Masha Fedzechkina, Skyler Seto, Katherine Metcalf, Miguel Sarabia, Nicholas Apostoloff, Barry-John Theobald
Institution: Apple Inc.
[ICASSP-2023]
Practice of the Conformer Enhanced Audio-Visual Hubert on Mandarin and English
Authors: Xiaoming Ren, Chao Li, Shenjian Wang, Biao Li
Institution: Beijing OPPO Telecommunications Corp., ltd.
[ICASSP-2023]
Robust Audio-Visual ASR with Unified Cross-Modal Attention
Authors: Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Institution: Shanghai Jiao Tong University
[IJCAI-2023]
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China
[Interspeech-2023]
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
Institution: The University of Texas at Austin; Carnegie Mellon University
[Interspeech-2023]
Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning
Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima
Institution: Waseda University; Waseda Research Institute for Science and Engineering
[ACL-2023]
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Authors: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao
Institution: Zhejiang University; ByteDance
[ACL-2023]
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen; University of Science and Technology of China
[ACL-2023]
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng
Institution: Nanyang Technological University; University of Aberdeen
[IJCNN-2023]
Exploiting Deep Learning for Sentence-Level Lipreading
Authors: Isabella Wu, Xin Wang
Institution: Choate Rosemary Hall; Stony Brook University
[IJCNN-2023]
GLSI Texture Descriptor Based on Complex Networks for Music Genre Classification
Authors: Andrés Eduardo, Coca Salazar
Institution: Federal University of Technology - Paraná
[ICME-2023]
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder
Authors: Yusheng Dai, Hang Chen, Jun Du, Xiaofei Ding, Ning Ding, Feijun Jiang, Chin-Hui Lee
Institution: University of Science and Technology of China; Alibaba Group; Georgia Institute of Technology
[ICME-2023]
Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition
Authors: Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui
Institution: Ocean University of China; University of Technology Sydney
[AAAI-2024]
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
Institution: NERC-SLIP, University of Science and Technology of China (USTC), Hefei, China; Tencent AI LAB; Nanyang Technological University, Singapore
[CVPR-2024]
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Authors: Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma, Haotian Wang, Chin-Hui Lee
Institution: University of Science and Technology of China, Hefei, China; Georgia Institute of Technology, Atlanta, America
[IJCNN-2024]
Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng
Institution: The Chinese University of Hong Kong
[InterSpeech-2024]
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha
Institution: University of Maryland, College Park, USA
[InterSpeech-2024]
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Institution: MIT, USA; IBMResearch AI, USA; MIT-IBM Watson AI Lab, USA; University of Bonn, Germany
Speaker Recognition
[MTA-2016]
Audio-visual Speaker Diarization Using Fisher Linear Semi-discriminant Analysis
Authors: Nikolaos Sarafianos, Theodoros Giannakopoulos, Sergios Petridis
Institution: National Center for Scientific Research “Demokritos”
[ICASSP-2018]
Audio-visual Person Recognition in Multimedia Data From the Iarpa Janus Program
Authors: Gregory Sell, Kevin Duh, David Snyder, Dave Etter, Daniel Garcia-Romero
Institution: The Johns Hopkins University
[ICASSP-2019]
Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion
Authors: Suwon Shon, Tae-Hyun Oh, James Glass
Institution: MIT Computer Science and Artificial Intelligence Laboratory, Cambridge
[Interspeech-2019]
Who Said That?: Audio-visual Speaker Diarisation Of Real-World Meetings
Authors: Joon Son Chung, Bong-Jin Lee, Icksang Han
Institution: Naver Corporation
[ICASSP-2020]
Self-Supervised Learning for Audio-visual Speaker Diarization
Authors: Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang
Institution: University of Central Florida; Tencent AI Lab; Beijing University of Posts and Telecommunications
[ICASSP-2021]
A Multi-View Approach to Audio-visual Speaker Verification
Authors: Leda Sari, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf
Institution: University of Illinois at Urbana-Champaign, Facebook AI Research
[IEEE/ACM TASLP-2021]
Audio-visual Deep Neural Network for Robust Person Verification
Authors: Yanmin Qian, Zhengyang Chen, Shuai Wang
Institution: Shanghai Jiao Tong University
[ICDIP 2022]
End-To-End Audiovisual Feature Fusion for Active Speaker Detection
Authors: Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu
Institution: Interdisciplinary Innovation Research Institute, Zhejiang Lab; Dalhousie University; University of Electronic
Science and Technology of China; Zhejiang University
[EUVIP-2022]
Active Speaker Recognition using Cross Attention Audio-Video Fusion
Authors: Bogdan Mocanu, Tapu Ruxandra
Institution: University “Politehnica” of Bucharest; Télécom SudParis
[2022]
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Authors: Rahul Sharma, Shrikanth Narayanan
Institution: University of Southern California
[SLT-2023]
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection
Authors: Xuanjun Chen, Haibin Wu, Helen Meng, Hung-yi Lee, Jyh-Shing Roger Jang
Institution: National Taiwan University; The Chinese University of Hong Kong
[ICAI-2023]
Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Institution: University of Engineering and Technology Taxila; Swarm Robotics Lab NCRA; Deutsches Elektronen-Synchrotron DESY
[CVPR-2023]
A Light Weight Model for Active Speaker Detection
Authors: Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen
Institution: Sichuan University; The Chinese University of Hong Kong
[ICASSP-2023]
The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
Authors: Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu
Institution: University of Science and Technology of China; Georgia Institute of Technology; Carnegie Mellon University; Kore University of Enna; iFlytek; Northwestern Polytechnical University; Delft University of Technology
[ICASSP-2023]
ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting
Authors: Zexu Pan, Wupeng Wang, Marvin Borsdorf, Haizhou Li
Institution: National University of Singapore; University of Bremen; The Chinese University of Hong Kong
[ICASSP-2023]
Speaker Recognition with Two-Step Multi-Modal Deep Cleansing
Authors: Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li
Institution: National University of Singapore; A*STAR; The Chinese University of Hong Kong; University of Bremen; Shenzhen Research Institute of Big Data
[ICASSP-2023]
Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction
Authors: Timothée Dhaussy, Bassam Jabaian, Fabrice Lefèvre, Radu Horaud
Institution: Avignon University; Université Grenoble Alpes
[ICASSP-2023]
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification
Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang
Institution: Tianjin University; A⋆STAR;Singapore Institute of Technology; National Institute of Informatics
[ICASSP-2023]
Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge
Authors: Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu
Institution: Shanghai Jiao Tong University
[ICASSP-2023]
Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction
Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng
Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong
[ICASSP-2023]
The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge
Authors: Ming Cheng, Haoxu Wang, Ziteng Wang, Qiang Fu, Ming Li
Institution: Wuhan University; Duke Kunshan University; Alibaba Group
[ICASSP-2023]
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning
Authors: Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang
Institution: Tianjin University; A*STAR
[Interspeech-2023]
Target Active Speaker Detection with Audio-visual Cues
Authors: Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li
Institution: National University of Singapore; The Chinese University of Hong Kong
[Interspeech-2023]
CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition
Authors: Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang
Institution: Tsinghua University; Beijing University of Posts and Telecommunications
[Interspeech-2023]
Rethinking the visual cues in audio-visual speaker extraction
Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang
Institution: Tianjin University; National University of Singapore; Shenzhen Research Institute of Big Data
[ICAI-2023]
Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf
Institution: University of Engineering and Technology Taxila; National Centre of Robotics and Automation; Deutsches Elektronen-Synchrotron
[ACL-2023]
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
Authors: Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao
Institution: Zhejiang University; Huawei Cloud
[ICASSP-2023]
AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng
Institution: Tsinghua University; Xiaomi Inc.; The Chinese University of Hong Kong
[Interspeech-2023]
PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
Authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li
Institution: Shenzhen Research Institute of Big Data; The Chinese University of Hong Kong; National University of Singapore
[IEEE/ACM TASLP-2023]
A Dynamic Convolution Framework for Session-Independent Speaker Embedding Learning
Authors: Bin Gu, Jie Zhang, Wu Guo
Institution: University of Science and Technology of China
[IEEE/ACM TASLP-2024]
Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification
Authors: Bing Han, Zhengyang Chen, Yanmin Qian
Institution: Shanghai Jiao Tong University
[ICASSP-2024]
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman
Institution: Visual Geometry Group, Department of Engineering Science, University of Oxford, UK
[FG-2024]
Dynamic Cross Attention for Audio-Visual Person Verification
Authors: R. Gnana Praveen, Jahangir Alam
Institution: Computer Research Institute of Montreal (CRIM), Montreal, Canada
Action Recognition
[IJCNN-2016]
Exploring Multimodal Video Representation For Action Recognition
Authors: Cheng Wang; Haojin Yang; Christoph Meinel
Institution: University of Potsdam
[CVPR-2018]
The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
Authors: Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, Cuong Duc Dao
Institution: King Abdullah University of Science and Technology; Stanford University; Universidad del Norte; Universiteit van Amsterdam
[ICCV-2019]
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
Institution: University of Bristol; University of Oxford
[ICCV-2019]
SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Facebook AI Research
[ICCV-2019]
Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference
Authors: Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, Jonathan Huang
Institution: Intel Labs
[CVPR-2020]
Listen to Look: Action Recognition by Previewing Audio
Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
Institution: The University of Texas at Austin; Facebook AI Research
[2020]
Audiovisual SlowFast Networks for Video Recognition
Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
Institution: University of California; Facebook AI Research
[ICCV-2021]
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
Authors: Rameswar Panda, Chun-Fu(Richard) Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris
Institution: MIT-IBM Watson AI Lab; Boston University; Massachusetts Institute of Technology
[2021]
Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia
[WACV-2022]
Domain Generalization Through Audio-Visual Relative Norm Alignment in First Person Action Recognition
Authors: Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
Institution: Politecnico di Torino; Istituto Italiano di Tecnologia; CINI Consortium
[CVPR-2022]
Audio-Adaptive Activity Recognition Across Video Domains
Authors: Yunhua Zhang, Hazel Doughty, Ling Shao, Cees G. M. Snoek
Institution: University of Amsterdam; Inception Institute of Artificial Intelligence
[WACV-2022]
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Authors: Jiawei Chen, Chiu Man Ho
Institution: OPPO US Research Center
[CVPR-2022]
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Authors: Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou
Institution: Shenzhen University; Guangdong Key Laboratory of Intelligent Information Processing; Pazhou Lab
[2022]
Noise-Tolerant Learning for Audio-Visual Action Recognition
Authors: Haochen Han, Qinghua Zheng, Minnan Luo, Kaiyao Miao, Feng Tian, Yan Chen
Institution: Xi’an Jiaotong University, the Shanxi Provincial Key Laboratory of Institute of Multimedia
Knowledge Fusion and Engineering; the Ministry of Education Key Laboratory for
Intelligent Networks and Network Security
[ICLR-2023]
Exploring Temporally Dynamic Data Augmentation for Video Recognition
Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee
Institution: NAVER Clova; Korea Advanced Institute of Science and Technology; NAVER AI Lab; Yonsei University
[ICASSP-2023]
Epic-Sounds: A Large-scale Dataset of Actions That Sound
Authors: Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
Institution: University of Oxford; University of Bristol
[ICASSP-2023]
AV-TAD: Audio-Visual Temporal Action Detection With Transformer
Authors: Yangcheng Li, Zefang Yu, Suncheng Xiang, Ting Liu, Yuzhuo Fu
Institution: Shanghai Jiao Tong University
[ICCV-2023]
Audio-Visual Glance Network for Efficient Video Recognition
Authors: Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim
Institution: Korea Advanced Institute of Science and Technology
[IEEE TMM-2023]
Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
Authors: Maregu Assefa, Wei Jiang, Jinyu Zhan, Kumie Gedamu, Getinet Yilma, Melese Ayalew, Deepak Adhikari
Institution: University of Electronic Science and Technology of China; Sichuan Artificial Intelligence Research Institute; Adama Science and Technology University
[CVPR-2024]
TIM: A Time Interval Machine for Audio-Visual Action Recognition
Authors: Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen
Institution: University of Bristol; VGG, University of Oxford; Czech Technical University in Prague
Emotion Recognition
[EMNLP-2017]
Tensor Fusion Network for Multimodal Sentiment Analysis
Authors: Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University
[AAAI-2018]
Multi-attention Recurrent Network for Human Communication Comprehension
Authors: Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Nanyang Technological University
[AAAI-2018]
Memory Fusion Network for Multi-view Sequential Learning
Authors: Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, Louis-Philippe Morency
Institution: Carnegie Mellon University; Instituto Polite cnico Nacional; Nanyang Technological University
[NAACL-2018]
Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos
Authors: Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, Roger Zimmermann
Institution: National University of Singapore
[EMNLP-2018]
Contextual Inter-modal Attention for Multi-modal Sentiment Analysis
Authors: Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna; Nanyang Technological University
[ACL-2019]
Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model
Authors: Yitao Cai, Huiyu Cai, Xiaojun Wan
Institution: Peking University
[ACL-2020]
Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis
Authors: Dushyant Singh Chauhan, Dhanush S R, Asif Ekbal and Pushpak Bhattacharyya
Institution: Indian Institute of Technology Patna
[ACL-2020]
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Authors: Jean-Benoit Delbrouck, Noe Tits, Mathilde Brousmiche, Stephane Dupont
Institution: University of Mons
[ACL-2020]
Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation
Authors: Aman Shenoy, Ashish Sardana
Institution: Birla Institute of Technology and Science, Pilani; NVIDIA Graphics
[CVPR-2021]
Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences
Authors: Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin
Institution: Southwest Jiaotong University; Southwestern University of Finance and Economics; Tencent; University of Electronic Science and Technology of China; Nanyang Technological University
[IEEE TAFFC-2021]
Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations
Authors: Manjot Bedi, Shivani Kumar, Md Shad Akhtar, Tanmoy Chakraborty
Institution: Indraprastha Institute of Information Technology, Delhi
[IEEE SLT-2021]
Detecting expressions with multimodal transformers
Authors: Srinivas Parthasarathy, Shiva Sundaram
Institution: Amazon
[CVPR-2022]
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation
Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe
Institution: Sony Research India
[CCC-2022]
A Multimodal Emotion Perception Model based on Context-Aware Decision-Level Fusion
Authors: Yishan Chen; Zhiyang Jia; Kaoru Hirota; Yaping Dai
Institution: Beijing Institute of Technology; State Key Laboratory of Intelligent Control and Decision of Complex Systems
[IJCNN-2022]
Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis
Authors: Lingyong Fang, Gongshen Liu, Ru Zhang
Institution: Shanghai Jiao Tong University; Beijing University Posts and Telecommunications
[IEEE/ACM TASLP-2022]
EmoInt-Trans: A Multimodal Transformer for Identifying Emotions and Intents in Social Conversations
Authors: Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, Pushpak Bhattacharyya
Institution: Indian Institute of Technology
[ICPR-2022]
Self-attention fusion for audiovisual emotion recognition with incomplete data
Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj
Institution: Tampere University; Aarhus University
[IEEE TAFFC-2023]
Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels
Authors: Yuanyuan Lei, Houwei Cao
Institution: Texas A&M University; New York Institute of Technology
[IEEE T-BIOM-2023]
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Authors: R Gnana Praveen, Patrick Cardinal, Eric Granger
Institution: Ecole de technologie supérieure
[ICASSP-2023]
Adapted Multimodal Bert with Layer-Wise Fusion for Sentiment Analysis
Authors: Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos
Institution: National Technical University of Athens; Institute for Language and Speech Processing
[ICASSP-2023]
Recursive Joint Attention for Audio-Visual Fusion in Regression Based Emotion Recognition
Authors: R Gnana Praveen, Eric Granger, Patrick Cardinal
Institution: École de Technologie supérieure
[IEEE/ACM TASLP-2023]
Exploring Semantic Relations for Social Media Sentiment Analysis
Authors: Jiandian Zeng, Jiantao Zhou, Caishi Huang
Institution: Beijing Normal University; China University of Macau; University of Macau
[CVPR-2023]
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
Authors: Zhicheng Zhang, Lijuan Wang, Jufeng Yang
Institution: Nankai University
[ACM MM-2023]
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Authors: Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie Zhang, Yuzhe Weng
Institution: University of Science and Technology of China; Northwestern Polytechnical University
[arxiv-2024]
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China; Department of Automation, Tsinghua University, Beijing, China; Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
[IJCAI-2024]
HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis
Authors: Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu
Institution: Tongji University; Beijing Institute of Technology; University of Oxford; DeepBlue Academy of Sciences
[ICPR-2024]
Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition
Authors: Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson
Institution: School of Computing Science, University of Glasgow
[InterSpeech-2024]
AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
Authors: Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma
Institution: IIIT-Delhi, India; University of Tartu, Estonia
Uni-modal Enhancement
Speech Enhancement and Separation
[Interspeech-2018]
Visual Speech Enhancement
Authors: Aviv Gabbay, Asaph Shamir, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[Interspeech-2018]
The Conversation: Deep Audio-Visual Speech Enhancement
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford
[IEEE TETCI-2018]
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
Authors:
Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Institution: Research Center for Information Technology Innovation; National Taiwan University; National Yang-Ming University; Mackay Medical College; Academia Sinica
[ICASSP-2018]
Seeing Through Noise: Visually Driven Speaker Separation And Enhancement
Authors: Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[GlobalSIP-2019]
Visually Assisted Time-Domain Speech Enhancement
Authors: Elham Ideli, Bruce Sharpe, Ivan V. Baji?, Rodney G. Vaughan
Institution: Simon Fraser University; SingSoftNext
[ICASSP-2019]
On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement
Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Institution: Aalborg University; Oticon A/S
[InterSpeech-2019]
Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues
Authors: Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Tomohiro Nakatani
Institution: Nippon Telegraph & Telephone Corporation
[Interspeech-2019]
My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions
Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford; Naver Corporation
[2020]
Facefilter: Audio-Visual Speech Separation Using Still Images
Authors: Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
Institution: Yonsei University; Naver Corporation
[ICASSP-2020]
Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders
Authors: Mostafa Sadeghi, Xavier Alameda-Pineda
Institution: Inria Grenoble Rhone-Alpes
[CVPR-2021]
Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation
Authors: Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, Kwanghoon Sohn
Institution: Yonsei University; Naver Corporation; Korea Aerospace University
[ISCAS-2021]
Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
Authors: Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, Chiara Bartolozzi
Institution: Istituto Italiano di Tecnologia; University of Modena and Reggio Emilia
[ICASSP-2022]
The Impact of Removing Head Movements on Audio-Visual Speech Enhancement
Authors: Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob Donley, Anurag Kumar
Institution: Inria Grenoble; Université Grenoble Alpes; Inria Nancy Grand-Est; Reality Labs Research
[2022]
Dual-path Attention is All You Need for Audio-Visual Speech Extraction
Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Institution: University of Illinois at Urbana-Champaign
[ICASSP-2022]
Audio-visual multi-channel speech separation, dereverberation and recognition
Authors: Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng
Institution: The Chinese University of Hong Kong; Tencent AI lab
[2022]
Audio-visual speech separation based on joint feature representation with cross-modal attention
Authors: Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang
Institution: Northwestern Polytechnical University; Nanchang University
[CVPR-2022]
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Authors: Karren Yang, Dejan Marković, Steven Krenn, Vasu Agrawal, Alexander Richard
Institution: Massachusetts Institute of Technology; Meta Reality Labs Research
[IEEE MMSP-2022]
As We Speak: Real-Time Visually Guided Speaker Separation and Localization
Authors: Piotr Czarnecki, Jakub Tkaczuk
Institution: Warsaw University of Technology
[IEEE HEALTHCOM-2022]
A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids
Authors: Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Ahsan Adeel, Amir Hussain, Mathini Sellathurai, Tharmalingam Ratnarajah
Institution: University of Edinburgh; Heriot-Watt Watt University; Edinburgh Napier University; University of Wolverhampton
[CVPR-2022]
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Authors: Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
Institution: University of Oxford
[WACV-2023]
BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds
Authors: Youshan Zhang, Jialu Li
Institution: Yeshiva University; Cornell University
[SLT-2023]
AVSE Challenge: Audio-Visual Speech Enhancement Challenge
Authors: Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Ondrej Klejch, Mandar Gogate, Kia Dashtipour, Amir Hussain, Peter Bell
Institution: University of Edinburgh
[ICLR-2023]
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
Authors: Haoyue Cheng, Zhaoyang Liu, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime
[WACV-2023]
Unsupervised Audio-Visual Lecture Segmentation
Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
Institution: International Institute of Information Technology, Hyderabad
[ISCSLP-2022]
Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement
Authors: Chenxi Wang, Hang Chen, Jun Du, Baocai Yin, Jia Pan
Institution: University of Science and Technology of China; iFlytek
[ICASSP-2023]
Real-Time Audio-Visual End-to-End Speech Enhancement
Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang
Institution: Microsoft
[ICASSP-2023]
Efficient Intelligibility Evaluation Using Keyword Spotting: A Study on Audio-Visual Speech Enhancement
Authors: Cassia Valentini-Botinhao, Andrea Lorena Aldana Blanco, Ondrej Klejch, Peter Bell
Institution: University of Edinburgh
[ICASSP-2023]
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement
Authors: Chenyue Zhang, Hang Chen, Jun Du, Baocai Yin, Jia Pan, Chinhui Lee
Institution: University of Science and Technology of China; iFlytek Co., Ltd.; Georgia Institute of Technology
[ICASSP-2023]
Real-Time Audio-Visual End-To-End Speech Enhancement
Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang
Institution: Microsoft
[ICASSP-2023]
Audio-Visual Speech Enhancement with a Deep Kalman Filter Generative Model
Authors: Ali Golmakani, Mostafa Sadeghi, Romain Serizel
Institution: Université de Lorraine
[ICASSP-2023]
A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement
Authors: Haitao Xu, Liangfa Wei, Jie Zhang, Jianming Yang, Yannan Wang, Tian Gao, Xin Fang, Lirong Dai
Institution: University of Science and Technology of China; Ethereal Audio Lab; Tsinghua Shenzhen International Graduate School
[ICASSP-2023]
Egocentric Audio-Visual Noise Suppression
Authors: Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu, Kaustubh Kalgaonkar
Institution: Carnegie Mellon University; Meta
[ICASSP-2023]
Dual-Path Cross-Modal Attention for Better Audio-Visual Speech Extraction
Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Institution: University of Illinois at Urbana-Champaign
[ICASSP-2023]
On the Role of Visual Context in Enriching Music Representations
Authors: Kleanthis Avramidis, Shanti Stewart, Shrikanth Narayanan
Institution: University of Southern California
[ICASSP-2023]
LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders
Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic
Institution: Imperial College London; Meta
[ICASSP-2023]
Learning Audio-Visual Dereverberation
Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman
Institution: The University of Texas at Austin; Meta AI
[Interspeech-2023]
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Institution: University of Science and Technology of China
[Interspeech-2023]
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
Authors: Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann
Institution: Tsinghua University; Universität Hamburg; Chinese Institute for Brain Research
[ITG-2023]
Audio-Visual Speech Enhancement with Score-Based Generative Models
Authors: Julius Richter, Simone Frintrop, Timo Gerkmann
Institution: Universität Hamburg
[Interspeech-2023]
Speech inpainting: Context-based speech synthesis guided by video
Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen
Institution: Universitat Pompeu Fabra; Aalborg University; Oticon A/S
[EUSIPCO-2023]
Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction
Authors: Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima
Institution: Waseda University; Waseda Research Institute for Science and Engineering
[IEEE/ACM TASLP-2023]
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Institution: The Chinese University of Hong Kong
[ICCV-2023]
AdVerb: Visually Guided Audio Dereverberation
Authors: Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
Institution: University of Maryland; University of Montreal
[TEEE/ACM TASLP-2023]
Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments
Authors: Jiaming Xu, Jian Cui, Yunzhe Hao, Bo Xu
Institution: Xiaomi Corporation; University of Chinese Academy of Sciences
[ICASSP-2024]
Consistent and Relevant: Rethink the Query Embedding in General Sound Separation
Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng
Institution: Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Tencent AI Lab, Audio and Speech Signal Processing Oteam, China; The Chinese University of Hong Kong, Hong Kong SAR, China;
[ICASSP-2024]
SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech
Authors: Adam Sabra, Cyprian Wronka, Michelle Mao, Samer Hijazi
Institution: Cisco Systems, Inc
[IJCAI-2024]
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Authors: Zhaoxi Mu, Xinyu Yang
Institution: Xi’an Jiaotong University
[InterSpeech-2024]
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology, South Korea
[InterSpeech-2024]
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
Authors: Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic
Institution: Meta AI, UK; Imperial College London, UK
[ACM MM-2024]
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Authors: Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu
Institution: Nanjing University, State Key Laboratory for Novel, Software Technology, Nanjing, China;
[INTERSPEECH-2024]
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
Authors: Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary, Krish Agrawal, Rupal Shah, Rohan Jha, M. Sajid, Amir Hussain, M. Tanveer
Institution: Indian Institute of Technology Indore, Simrol, Indore, 453552, India; School of Computing, Edinburgh Napier University, EH11 4BN, Edinburgh, United Kingdom;
Object Sound Separation
[ECCV-2018]
Learning to Separate Object Sounds by Watching Unlabeled Video
Authors: Ruohan Gao, Rogerio Feris, Kristen Grauman
Institution: The University of Texas at Austin; IBM Research; Facebook AI Research
[ECCV-2018]
The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University
[ICASSP-2019]
Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
Recursive Visual Sound Separation Using Minus-Plus Net
Authors: Xudong Xu, Bo Dai, Dahua Lin
Institution: The Chinese University of Hong Kong
[ICCV-2019]
Co-Separating Sounds of Visual Objects
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ACCV-2020]
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
Authors: Lingyu Zhu, Esa Rahtu
Institution: Tampere University
[CVPR-2020]
Music Gesture for Visual Sound Separation
Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2021]
Visual Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois at Urbana-Champaign; Mitsubishi Electric Research Laboratories
[CVPR-2021]
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
Authors: Yapeng Tian, Di Hu, Chenliang Xu
Institution: University of Rochester; Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods
[ECCV-2022]
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign
[ECCV-2022]
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
Institution: Google Research; University of Illinois Urbana-Champaign
[ICIP-2022]
Visual Sound Source Separation with Partial Supervision Learning
Authors: Huasen Wang, Lingling Gao, Qianchao Tan, Luping Ji
Institution: University of Electronic Science and Technology of China
[NeurIPS-2022]
Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
Institution: University of Illinois; Mitsubishi Electric Research Labs
[ICLR-2023]
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
Institution: Sony Group Corporation; University of California San Diego
[CVPR-2023]
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon,
Institution: Oriol Nieto, Bryan Russell, Kate Saenko
Boston University; Adobe Research; MIT-IBM Watson AI Lab, IBM Research
[CVPR-2023]
iQuery: Instruments As Queries for Audio-Visual Sound Separation
Authors: Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, Jianbo Shi
Institution: University of California San Diego; Shanghai AI Laboratory; The Chinese University of Hong Kong; National University of Singapore; ShanghaiTech University; University of Pennsylvania
[ICCV-2023]
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
Authors: Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu
Institution: Michigan State University; University of Rochester; University of Texas at Dallas
[WACV-2024]
LAVSS: Location-Guided Audio-Visual Spatial Audio Separation
Authors: Yuxin Ye, Wenming Yang, Yapeng Tian
Institution: Tsinghua University; The University of Texas at Dallas
[NeurIPS-2024]
Continual Audio-Visual Sound Separation
Authors: Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian
Institution: The University of Texas at Dallas; Brown University; Carnegie Mellon University;
Face Super-resolution and Reconstruction
[CVPR-2020]
Learning to Have an Ear for Face Super-Resolution
Authors: Givi Meishvili, Simon Jenni, Paolo Favaro
Institution: University of Bern
[IEEE TCSVT-2021]
Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer
Authors:
Chenqi Kong, Baoliang Chen, Wenhan Yang, Haoliang Li, Peilin Chen, Shiqi Wang
Institution: City University of Hong Kong; Nanyang Technological University
[ICASSP-2022]
Deep Video Inpainting Guided by Audio-Visual Self-Supervision
Authors: Kyuyeon Kim; Junsik Jung; Woo Jae Kim; Sung-Eui Yoon
Institution: Korea Advanced Institute of Science and Technology
[CVPR-2022]
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
Authors: Cho-Ying Wu, Chin-Cheng Hsu, Ulrich Neumann
Institution: University of Southern California
[WACV-2023]
Audio-Visual Face Reenactment
Authors: Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
Institution: International Institute of Information Technology, Hyderabad; University of Bath
[ICASSP-2023]
Hearing and Seeing Abnormality: Self-Supervised Audio-Visual Mutual Learning for Deepfake Detection
Authors: Changsung Sung, Juncheng Chen, Chusong Chen
Institution: National Taiwan University; Academia Sinica
[CVPR-2023]
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
Authors: Aggelina Chatziagapi, Dimitris Samaras
Institution: Stony Brook University
[CVPR-2023]
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
Authors: Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li
Institution: Sun Yat-sen University; Cardiff University
[CVPR-2023]
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
Authors: Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong
Institution: The Chinese University of Hong Kong; Tencent AI Lab
[ICASSP-2024]
Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection
Authors: Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, Philip J. B. Jackson
Institution: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, U.K.
[CVPR-2024]
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj
Institution: University of Maryland - College Park; Reality Defender Inc.
[BMVC-2024]
Content and Style Aware Audio-Driven Facial Animation
Authors: Qingju Liu, Hyeongwoo Kim, Gaurav Bharaj
Institution: Flawless AI, UK; Imperial College London, UK;
[BMVC-2024]
Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies
Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada
Institution: Computer Vision, Imaging & Machine, Intelligence Research Group (CVI2), Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg; Cristal Laboratory, National School of Computer Sciences, Manouba University, Tunisia;
[SIGGRAPH-2024]
PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
Authors: Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu
Institution: Bytedance,China;
Cross-modal Perception
Cross-modal Generation
Mono Sound Generation
Speech
[ICASSP-2017]
Vid2speech: Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[ICCV-2017]
Improved Speech Reconstruction From Silent Video
Authors: Ariel Ephrat, Tavi Halperin, Shmuel Peleg
Institution: The Hebrew University of Jerusalem
[ICASSP-2018]
Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video
Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani
Institution: Columbia University
[ACM MM-2018]
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Authors: Yaman Kumar, Mayank Aggarwa, Pratham Nawal, Shin’ichi Satoh, Rajiv Ratn Shah, Roger Zimmermann
Institution: Netaji Subhas Institute of Technology; National Institute of Informatics; Indraprastha Institute of Information; National University of Singapore
[2019]
Video-Driven Speech Reconstruction using Generative Adversarial Networks
Authors: Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Centre
[Interspeech-2019]
Hush-Hush Speak: Speech Reconstruction Using Silent Videos
Authors: Shashwat Uttam, Yaman Kumar Singla, Dhruva Sahrawat, Mansi Agarwal
Institution: Netaji Subhas Institute of Technology; Adobe Research; National University of Singapore; Delhi Technological University
[ICASSP-2021]
Learning Audio-Visual Correlations From Variational Cross-Modal Generation
Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Technology Sydney; Cisco
[IEEE TCYB-2022]
End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
Authors: Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Bj?rn W. Schuller, Maja Pantic
Institution: Imperial College London; University of Augsburg; Meta AI
[ICPR-2022]
Learning Speaker-specific Lip-to-Speech Generation
Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M Hegde
Institution: Indian institute of Technology; University of Bath
[ICASSP-2023]
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
Authors: Jiyoung Lee, Joon Son Chung, Soo-Whan Chung
Institution: NAVER AI Lab; Korea Advanced Institute of Science and Technology; NAVER Cloud
[CVPR-2023]
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration
Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
Institution: Meta AI; Meta Reality Labs Research; Toyota Technological Institute at Chicago; The Hebrew University of Jerusalem
[ICCV-2023]
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[CVPR-2024]
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Authors: Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen
Institution: HKUST; ARCLab,Tencent PCG
Music
[IEEE TMM-2015]
Real-Time Piano Music Transcription Based on Computer Vision
Authors: Mohammad Akbari, Howard Cheng
Institution: Simon Fraser University; University of Lethbridge
[ACM MM-2017]
Deep Cross-Modal Audio-Visual Generation
Authors:
Lele Chen, Sudhanshu Srivastava,
Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[NeurIPS-2020]
Audeo: Audio Generation for a Silent Performance Video
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington
[ECCV-2020]
Foley Music: Learning to Generate Music from Videos
Authors: Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
Institution: Cambridge
[ICASSP-2020]
Sight to Sound: An End-to-End Approach for Visual Piano Transcription
Authors: A. Sophia Koepke, Olivia Wiles
, Yael Moses, Andrew Zisserman
Institution: University of Oxford; The Interdisciplinary Center
[2020]
Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: University of Washington
[ICASSP-2021]
Collaborative Learning to Generate Audio-Video Jointly
Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi
Institution: Indian Institute of Technology Kanpur; University of Bath; Indian Institute of Technology Bombay
[ACM-2021]
Video Background Music Generation with Controllable Music Transformer
Authors: Shangzhe Di,
Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan
Institution: Beihang University; Charterhouse School, Godalming, Surrey; Sea AI Lab
[2022]
Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation
Authors: Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia
Institution: New York University, Shanghai; Queen Mary University of London; Tencent Inc.; Mohamed bin Zayed University of Artificial Intelligence
[CVPR-2023]
Conditional Generation of Audio from Video via Foley Analogies
Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens
Institution: University of Michigan; Yale University; Adobe Research
[ICML-2023]
Long-Term Rhythmic Video Soundtracker
Authors: Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao
Institution: Shanghai Artificial Intelligence Laboratory
Natural Sound
[CVPR-2016]
Visually Indicated Sounds
Authors: Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman
Institution: Massachusetts Institute of Technology; U.C. Berkeley; Google Research
[CVPR-2018]
Visual to Sound: Generating Natural Sound for Videos in the Wild
Authors: Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg
Institution: University of North Carolina at Chapel Hill; Adobe Research
[IEEE TIP-2020]
Generating Visually Aligned Sound From Videos
Authors:
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan,
Institution: South China University of Technology; China Pazhou Laboratory; MIT-IBM Watson AI Lab
[BMVC-2021]
Taming Visually Guided Sound Generation
Authors: Vladimir Iashin, Esa Rahtu
Institution: Tampere University
[IEEE TCSVT-2022]
Towards an End-to-End Visual-to-Raw-Audio Generation With GAN
Authors: Shiguang Liu; Sijia Li; Haonan Cheng
Institution: Tianjin University
[ICASSP-2023]
I Hear Your True Colors: Image Guided Audio Generation
Authors: Roy Sheffer, Yossi Adi
Institution: The Hebrew University of Jerusalem
[CVPR-2023]
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
Institution: University of Washington; MIT-IBM Watson AI Lab; MIT; UMass Amherst
Spatial Sound Generation
[ACM TOG-2018]
Scene-aware audio for 360° videos
Authors: Dingzeyu Li, Timothy R.Langlois, Changxi Zheng
Institution: Columbia University; Adobe Research
[NeurIPS-2018]
Self-Supervised Generation of Spatial Audio for 360° Video
Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang
Institution: University of California San Diego; Adobe Research
[CVPR-2019]
2.5D Visual Sound
Authors: Ruohan Gao, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ICIP-2019]
Self-Supervised Audio Spatialization with Correspondence Classifier
Authors: Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, Ming-Hsuan Yang
Institution: University of California at Merced
[ECCV-2020]
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong
[CVPR-2021]
Visually Informed Binaural Audio Generation without Binaural Audios
Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin
Institution: The Chinese University of Hong Kong; Nanyang Technological University
[AAAI-2021]
Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services
[TOG-2021]
Binaural Audio Generation via Multi-task Learning
Authors: Sijia Li, Shiguang Liu, Dinesh Manocha
Institution: Tianjin University; University of Maryland at College Park
[WACV-2022]
Beyond Mono to Binaural: Generating Binaural Audio From Mono Audio With Depth and Cross Modal Attention
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; CDAC Noida; TensorTour Inc.
[CVPR-2023]
Novel-View Acoustic Synthesis
Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
Institution: University of Texas at Austin; Meta AI
Video Generation
talking face
[ACM TOG-2017]
Synthesizing Obama: learning lip sync from audio
Authors: Supasorn Suwajanakorn, Steven Maxwell Seitz, Ira Kemelmacher-Shlizerman
Institution: University of Washington
[ECCV-2018]
Lip Movements Generation at a Glance
Authors: Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[IJCV-2019]
You Said That?: Synthesising Talking Faces from Audio
Authors: Amir Jamaludin, Joon Son Chung, Andrew Zisserman
Institution: University of Oxford
[ICCV-2019]
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Authors: Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky
Institution: Samsung AI Center; Skolkovo Institute of Science and Technology
[IJCV-2020]
Realistic Speech-Driven Facial Animation with GANs
Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Samsung AI Research Centre Cambridge
[IJCV-2020]
GANimation: One-Shot Anatomically Consistent Facial Animation
Authors: Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer
Institution: Institut de Robòtica i Informàtica Industrial; The Ohio State University
[ACM TOG-2020]
Makelttalk: Speaker-Aware Talking-Head Animation
Authors: Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, Dingzeyu Li
Institution: University of Massachusetts Amherst; Huya Inc.; Adobe Research
[CVPR-2020]
FReeNet: Multi-Identity Face Reenactment
Authors: Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, Changjie Fan
Institution: Zhejiang University; Fuxi AI Lab
[ECCV-2020]
Neural Voice Puppetry: Audio-driven Facial Reenactment
Authors: Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, Matthias Nie?ner
Institution: Technical University of Munich; Saarland Informatics Campus
[CVPR-2020]
Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images
Authors: Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang
Institution: The Chinese University of Hong Kong; SenseTime Research
[ECCV-2020]
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation
Authors: Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, Chen Change Loy
Institution: SenseTime Research; Carnegie Mellon University; Center for Research on Intelligent Perception and Computing, CASIA; University of Chinese Academy of Sciences; Shenzhen Institutes of Advanced Technology, Chinese Academy of Science; Nanyang Technological University
[AAAI-2021]
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation
Authors: Lincheng Li, Suzhen Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan
Institution: Netease Fuxi AI Lab; University of Technology Sydney
[CVPR-2021]
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Authors: Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
Institution: The Chinese University of Hong Kong; SenseTime Research; Tokyo Institute of Technology; Nanyang Technological University
[CVPR-2021]
Audio-Driven Emotional Video Portraits
Authors: Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu
Institution: Nanjing University; The Chinese University of Hong Kong; The University of Sydney; SenseTime Research; Nanyang Technological University; Tsinghua University
[AAAI-2022]
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
Authors: Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
Institution: Netease Fuxi AI Lab; University of Technology Sydney
[TVCG-2022]
Generating talking face with controllable eye movements by disentangled blinking feature
Authors: Shiguang Liu, Jiaqi Hao
Institution: Tianjin University
[AAAI-2022]
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
Institution: Korea Advanced Institute of Science and Technology
[CVPR-2022]
FaceFormer: Speech-Driven 3D Facial Animation with Transformers
Authors: Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
Institution: The University of Hong Kong; The Hong Kong University of Science and Technology; Adobe Research; Texas A&M University
[CVPR-2023]
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Authors: Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li
Institution: National University of Singapore; University of Science and Technology Beijing; University of Electronic Science and Technology of China; The Chinese University of Hong Kong
[ICASSP-2023]
Free-View Expressive Talking Head Video Editing
Authors: Yuantian Huang, Satoshi Iizuka, Kazuhiro Fukui
Institution: University of Tsukuba
[ICASSP-2023]
Audio-Driven Facial Landmark Generation in Violin Performance using 3DCNN Network with Self Attention Model
Authors: Tingwei Lin, Chaolin Liu, Li Su
Institution: Taiwan International Graduate Program; Academia Sinica; National Chengchi University
[ICASSP-2023]
Naturalistic Head Motion Generation from Speech
Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald
Institution: University of Maryland; Apple Inc.
[ICASSP-2023]
Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound
Authors: Valentina Sanguineti, Sanket Thakur, Pietro Morerio, Alessio Del Bue, Vittorio Murino
Institution: Istituto Italiano di Tecnologia; University of Genova
[CVPR-2023]
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Authors: Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, Guanbin Li
Institution: Sun Yat-sen University; Xidian University; The University of Hong Kong
[CVPR-2023]
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Authors: Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, Fei Wang
Institution: Xi’an Jiaotong University; National Key Laboratory of Human-Machine Hybrid Augmented Intelligence; Tencent AI Lab; Ant Group
[ACM MM-2023]
Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline
Authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng
Institution: Du Xiaoman Financial; Shanghai Jiao Tong University
[CVPR-2023]
LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook
Authors: Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
Institution: Alibaba Group
[CVPR-2024]
Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel
Institution: KarlsruheInstitute of Technology; Istanbul Technical University; Carnegie Mellon University
[InterSpeech-2024]
Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert
Authors: Han EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Joo, Tae-Hyun Oh
Institution: Grad. School of Artificial Intelligence and Dept. of Electrical Engineering, POSTECH, Korea; ENSC, Bordeaux INP, France; KRAFTON, Korea; Inst. for Convergence Research and Education in Advanced Technology, Yonsei University, Korea
[ACM MM-2024]
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
Authors: Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu
Institution: BNRist, DCST, Tsinghua University; Baidu Inc.; Zhongguancun Laboratory; S-Lab, Nanyang Technological University
[ECCV-2024]
KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
Authors: Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang
Institution: South China University of Technology; Technical University of Munich; Pazhou Laboratory;
Gesture
[IVA-2018]
Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
Authors: Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, Kazuhiko Sumi
Institution: Hokkai Gakuen University Sapporo; Aoyama Gakuin University; Yokohama National University
[IVA-2019]
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology in Stockholm; Hokkai Gakuen University; Aoyama Gakuin University;
[CVPR-2019]
Learning Individual Styles of Conversational Gesture
Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik
Institution: University of California, Berkeley; Zebra Medical Vision; Massachusetts Institute of Technology
[ICMI-2019]
To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations,
Authors: Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, Yaser Sheikh
Institution: Carnegie Mellon University; Facebook Reality Labs
[EUROGRAPHICS-2020]
Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows
Authors: Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow
Institution: KTH Royal Institute of Technology
[ICMI-2020]
Gesticulator: A Framework For Semantically-Aware Speech-Driven Gesture Generation
Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, Hedvig Kjellstr?m
Institution: KTH Royal Institute of Technology
[2020]
Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency
Institution: Carnegie Mellon University; Seikei University
[ACM TOG-2020]
Speech Gesture Generation From The Trimodal Context Of Text, Audio, And Speaker Identity
Authors: Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, Geehyuk Lee
Institution: Korea Advanced Institute of Science and Technology; University of Science and Technology; Electronics and Telecommunications Research Institute
[CVPR-2022]
SEEG: Semantic Energized Co-Speech Gesture Generation
Authors: Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang
Institution: Alibaba; University of Technology Sydney; Zhejiang University
[IEEE TNNLS-2022]
VAG: A Uniform Model for Cross-Modal Visual-Audio Mutual Generation
Authors: Wangli Hao; He Guan; Zhaoxiang Zhang
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences
[CVPR-2023]
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Authors: Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu
Institution: The University of Hong Kong; The Chinese University of Hong Kong; Nanyang Technological University
[IJCAI-2024]
Bridge to Non-Barrier Communication: Gloss-Prompted Fine-grained Cued Speech Gesture Generation with Diffusion Model
Authors: Wentao Lei, Li Liu, Jun Wang
Institution: The Hong Kong University of Science and Technology (Guangzhou); Tencent AI Lab; The Hong Kong University of Science and Technology
Dance
[ACM MM-2018]
Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis
Authors: Taoran Tang, Jia Jia, Hanyang Mao
Institution: Tsinghua University
[CVPR-2018]
Audio to Body Dynamics
Authors: Eli Shlizerman, Lucio Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman
Institution: Facebook Inc.; Stanford University; University of Washington
[NeurIPS-2019]
Dancing to Music
Authors: Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, Jan Kautz
Institution: University of California; NVIDIA
[ICLR-2021]
Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
Authors: Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang
Institution: Fudan University; Microsoft STCA; Meituan; Rinna AI
[ICCV-2021]
AI Choreographer: Music Conditioned 3D Dance Generation With AIST++
Authors: Ruilong Li, Shan Yang, David A. Ross, Angjoo Kanazawa
Institution: University of Southern California; Google Research; University of California, Berkeley
[ICASSP-2022]
Genre-Conditioned Long-Term 3D Dance Generation Driven by Music
Authors: Yuhang Huang, Junjie Zhang, Shuyan Liu, Qian Bao, Dan Zeng, Zhineng Chen, Wu Liu
Institution: Shanghai University; University of Chinese Academy of Sciences; JD AI Research; Fudan University
[CVPR-2022]
Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
Authors: Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Chang Loy, Ziwei Liu
Institution: Nanyang Technological University; Sun Yat-Sen University; University of California, Los Angeles; SenseTime Research
[CVPR-2023]
MM-Diffusion: Learning Multi-Modal Diffusion Models for
Joint Audio and Video Generation
Authors: Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
Institution: Renmin University of China; Peking University; Microsoft Research
[IEEE TMM-2023]
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization
Authors: Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan
Institution: Shanghai Key Lab of Intelligent Information Processing; Fudan University; Tencent
[VCJ-2024]
QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation
Authors: Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li
Institution: Guangdong University of Technology, Guangdong, China; Huizhou University, Guangdong, China; Guangdong Mechanical and Electrical College, Guangdong, China; University of Western Australia, WA, Australia
Image Manipulation
[2021] Sound-guided semantic image manipulation Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young Kim, Jinkyu Kim, Sangpil Kim Korea University; Korea Advanced Institute of Science and Technology; NVIDIA Corp.
[2022] Learning visual styles from audio-visual associations Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute
[CVPR-2023]
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh
Institution: Pohang University of Science and Technology; Korea Advanced Institute of Science and Technology; University of Michigan; Yonsei University
[ACM MM-2024]
An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation
Authors: Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo
Institution: Beijing Institute of Technology, Beijing, China; Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China;Renmin University of China, Beijing, China
Depth Estimation
[ICRA-2020]
BatVision: Learning to See 3D Spatial Layout with Two Ears
Authors: Jesper Haahr Christensen; Sascha Hornauer; Stella X. Yu
Institution: Technical University of Denmark; University of California
[ECCV-2020]
VISUALECHOES: Spatial Image Representation Learning Through Echolocation
Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Institution: The University of Texas at Austin; Facebook Reality Lab; Facebook AI Research
[CVPR-2021]
Beyond Image to Depth: Improving Depth Prediction Using Echoes
Authors: Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma
Institution: Indian Institute of Technology Kanpur; Centre for Development of Advanced Computing Noida; TensorTour Inc.
[ICASSP-2022]
Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation
Authors: Go Irie, Takashi Shibata, Akisato Kimura
Institution: Nippon Telegraph & Telephone Corporation
[NeurIPS-2022]
Learning Neural Acoustic Fields
Authors: Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, Chuang Gan
Institution: Carnegie Mellon University; Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[NeurIPS-2022]
Few-Shot Audio-Visual Learning of Environment Acoustics
Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
Audio-visual Transfer Learning
[NeurIPS-2016]
SoundNet: Learning Sound Representations from Unlabeled Video
Authors: Yusuf Aytar, Carl Vondrick, Antonio Torralba
Institution: Massachusetts Institute of Technology
[ICCV-2019]
Self-Supervised Moving Vehicle Tracking With Stereo Sound
Authors: Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; IBM Research AI
[CVPR-2021]
There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge
Authors: Francisco Rivera Valverde, Juana Valeria Hurtado, Abhinav Valada
Institution: University of Freiburg
[AAAI-2021]
Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning
Authors: Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, Roger Zimmermann
Institution: National University of Singapore; National University of Singapore Northwestern Polytechnical University; Zhejiang Gongshang University; Indraprastha Institute of Information Technology, Delhi
[Interspeech-2021]
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
Authors: Leying Zhang, Zhengyang Chen, Yanmin Qian
Institution: Shanghai Jiao Tong University
[ICCV-2021]
Multimodal Knowledge Expansion
Authors: Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
Institution: Shanghai Qi Zhi Institute; UT Austin; South China University of Technology; Massachusetts Institute of Technology; Tsinghua University
[CVPR-2021]
Distilling Audio-visual Knowledge by Compositional Contrastive Learning
Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
Institution: University of Tubingen; MPI for Informatics; Tencent; Max Planck Institute for Intelligent Systems
[2022]
Estimating Visual Information From Audio Through Manifold Learning
Authors: Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, George Tzanetakis, Kwang Moo Yi
Institution: University of British Columbia; Huawei Technologies Canada Co., Ltd; University of Victoria
[DCASE-2021]
Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy
Authors: Chengxin Chen, Meng Wang, Pengyuan Zhang
Institution: Institute of Acoustics, CAS; University of Chinese Academy of Sciences
[Interspeech-2021]
Audiovisual transfer learning for audio tagging and sound event detection
Authors: Wim Boes, Hugo Van hamme
Institution: ESAT, KU Leuven
[2023]
Revisiting Pre-training in Audio-Visual Learning
Authors: Ruoxuan Feng, Wenke Xia, Di Hu
Institution: Hunan University; Renmin University of China
[IJCNN-2023]
A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques
Authors: Qichen Zheng, Jie Hong, Moshiur Farazi
Institution: Australian National University; CSIRO Data61
[ICCV-2023]
Audio-Visual Class-Incremental Learning
Authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
Institution: The University of Texas at Dallas; Carnegie Mellon University
[ICCV-2023]
Hyperbolic Audio-visual Zero-shot Learning
Authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson
Institution: Australian National University; CSIRO Data61
[CVPR-2023]
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan
Institution: Carnegie Mellon University
[ICCV-2023]
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Authors: Shentong Mo, Weiguo Pian, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas
[ICCV-2023]
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Authors: Heeseung Yun, Joonil Na, Gunhee Kim
Institution: Seoul National University
Cross-modal Retrieval
[2017]
Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
Authors: Sungeun Hong, Woobin Im, Hyun S. Yang
[ICCV-2017]
Image2song: Song Retrieval via Bridging Image Content and Lyric Words
Authors: Xuelong Li, Di Hu, Xiaoqiang Lu
Institution: Chinese Academy of Sciences; Northwestern Polytechnical University
[CVPR-2018]
Seeing voices and hearing faces: Cross-modal biometric matching
Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman
Institution: University of Oxford
[ECCV-2018]
Cross-modal Embeddings for Video and Audio Retrieval
Authors: Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto
Institution: Universitat Politecnica de Catalunya; Barcelona Supercomputing Center
[ISM-2018]
Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics
[TOMCCAP-2020]
Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Yi Yu, Keizo Oyama
Institution: National Institute of Informatics
[IEEE TGRS-2020]
Deep Cross-Modal Image–Voice Retrieval in Remote Sensing
Authors: Yaxiong Chen, Xiaoqiang Lu, Shuai Wang
Institution: China University of Chinese Academy of Sciences; Chinese Academy of Sciences
[2021]
Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval
Authors: Donghuo Zeng, Jianming Wu, Gen Hattori, Yi Yu, Rong Xu
Institution: KDDI Research, Inc.; National Institute of Informatics, SOKENDAI
[ICCV-2021]
Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion
Authors: Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, Guang Yang
Institution: Hangzhou Dianzi University; University of California; East China Normal University; University of Oxford; Wuhan University; Imperial College London
[ICMR-2024]
Anchor-aware Deep Metric Learning for Audio-visual Retrieval
Authors: Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu
Institution: KDDI Research, Inc.; Hiroshima University
[IJCAI-2022]
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng
Institution: National University of Defense Technology
[IEEE ISM-2022]
Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
Authors: Donghuo Zeng, Yanan Wang, Jianming Wu, Kazushi Ikeda
Institution: KDDI Research, Inc.
[IEEE SMC-2022]
Graph Network based Approaches for Multi-modal Movie Recommendation System
Authors: Daipayan Chakder, Prabir Mondal, Subham Raj, Sriparna Saha, Angshuman Ghosh, Naoyuki Onoe
Institution: Indian Institute of Technology; Sony Research
[CVPR-2022]
Visual Acoustic Matching
Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
Institution: University of Texas at Austin; Stanford University; Meta AI
[IEEE TMM-2023]
Deep Cross-Modal Retrieval Between Spatial Image and Acoustic Speech
Authors: Xinyuan Qian, Wei Xue, Qiquan Zhang, Ruijie Tao, Haizhou Li
Institution: University of Science and Technology; Hong Kong University of Science and Technology; University of New South Wales; National University of Singapore; The Chinese University of Hong Kong-Shenzhen
[ICASSP-2024]
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
Authors: Dennis Fedorishin, Lie Lu, Srirangaraj Setlur, Venu Govindaraju
Institution: Dolby Laboratories, University at Buffalo;
Audio-visual Collaboration
Audio-visual Representation Learning
[ICCV-2017]
Look, Listen and Learn
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford
[NeurIPS-2018]
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Authors: Bruno Korbar, Du Tran, Lorenzo Torresani
Institution: Dartmouth College; Facebook Research
[NeurIPS-2020]
Learning Representations from Audio-Visual Spatial Alignment
Authors: Pedro Morgado, Yi Li, Nuno Nvasconcelos
Institution: University of California
[NeurIPS-2020]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
Institution: King Abdullah University of Science and Technology; Facebook AI Research
[NeurIPS-2020]
Labelling Unlabelled Videos From Scratch With Multi-Modal Self-Supervision
Authors: Yuki Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi
Institution: University of Oxford; Facebook AI Research
[CVPR-2021]
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra
Institution: University of California San Diego; Facebook AI Research
[CVPR-2021]
Robust Audio-Visual Instance Discrimination
Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos
Institution: University of California San Diego; Facebook AI Research
[2021]
Unsupervised Sound Localization via Iterative Contrastive Learning
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; University of California; Snap Inc.; Google Research
[ICCV-2021]
Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos
Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Institution: Columbia University; Massachusetts Institute of Technology; University of Central Florida; Goethe University Frankfurt; IBM Research AI; MIT-IBM Watson AI Lab; The University of Texas at Austin; NYU-Courant CS & CDS
[2021]
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Authors: Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
Institution: Chinese Academy of Sciences
[NeurIPS-2021]
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Authors: Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
Institution: Columbia University; Google Inc.; Cornell University
[2021]
Audio-visual Representation Learning for Anomaly Events Detection in Crowds
Authors: Junyu Gao, Maoguo Gong, Xuelong Li
Institution: Xidian University; Northwestern Polytechnical University
[ICASSP-2022]
Audioclip: Extending Clip to Image, Text and Audio
Authors: Andrey Guzhov, Federico Raue, Jorn Hees, Andreas Dengel
Institution: Germany TU Kaiserslautern; Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
[CVPR-2022]
MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound
Authors: Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
Institution: University of Washington; Allen Institute for Artificial Intelligence; University of Edinburgh
[2022]
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning
Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi
Institution: Sensetime Research; The Hong Kong Polytechnic University
[NeurIPS-2022]
Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
Institution: Dartmouth College; Northwestern University
[IEEE TMM-2022]
Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations
Authors: Sijie Mai, Ying Zeng, Haifeng Hu
Institution: Sun Yat-sen University; National Natural Science Foundation of China
[CVPR-2022]
Audiovisual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tübingen; Robert Bosch GmbH; Max Planck Institute
[CVPRW-2022]
Multi-task Learning for Human Affect Prediction with Auditory–Visual Synchronized Representation
Authors: Euiseok Jeong;, Geesung Oh, Sejoon Lim
Institution: Kookmin University
[CVPR-2023]
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Authors: Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
Institution: The University of North Carolina at Chapel Hill
[CVPR-2022]
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Institution: University of Tubingen; Robert Bosch GmbH; Max Planck Institute
[ECCV-2022]
Temporal and cross-modal attention for audio-visual zero-shot learning
Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Institution: University of Tuebingen; Max Planck Institute
[NeurIPS-2022]
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Authors: Wei-Ning Hsu, Bowen Shi
Institution: Meta AI
[NeurIPS-2022]
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu
Institution: Texas A&M University; Google Research; University of Texas at Austin; Celonis Inc.
[AAAI-2023]
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Authors: Pritam Sarkar, Ali Etemad
Institution: Queen’s University; Vector Institute
[ICLR-2023]
Contrastive Audio-Visual Masked Autoencoder
Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass
Institution: Massachusetts Institute of Technology; The University of Texas at Austin; MIT-IBM Watson AI Lab; Goethe University Frankfurt
[ICLR-2023]
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Authors: Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
Institution: Imperial College London; Meta AI
[WACV-2023]
Audio Representation Learning by Distilling Video as Privileged Information
Authors: Amirhossein Hajavi, Ali Etemad
Institution: Queen’s University, Canada
[2023]
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli
Institution: University of California; Meta AI
[AAAI-2023]
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Authors: Simon Jenni, Alexander Black, John Collomosse
Institution: Adobe Research; University of Surrey
[CVPR-2023]
ImageBind One Embedding Space to Bind Them All
Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Institution: Meta AI
[NeurIPS-2023]
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
Institution: Zhejiang University; Shanghai Artificial Intelligence Laboratory; Huawei Noah’s Ark Lab
[WACV-2024]
OmniVec: Learning robust representations with cross modal sharing
Authors: Siddharth Srivastava, Gaurav Sharma
Institution: TensorTour Inc.
[InterSpeech-2024]
Zero-Shot Fake Video Detection by Audio-Visual Consistency
Authors: Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang
Institution: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China; Center for Speech and Language Technologies, BNRist, Tsinghua University, China
[ICML-2024]
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Authors: Kun Su, Xiulong Liu, Eli Shlizerman
Institution: Department of ECE, University of Washington, Seattle, United States; Department of Applied Math, University of Washington, Seattle, United States
Audio-visual Localization
Sound Localization in Videos
[ECCV-2018]
Objects that Sound
Authors: Relja Arandjelovic, Andrew Zisserman
Institution: Google Inc.; University of Oxford
[ECCV-2018]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Authors: Andrew Owens, Alexei A. Efros
Institution: University of California, Berkeley
[ECCV-2018]
The Sound of Pixels
Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab; Columbia University
[ICASSP-2019]
Self-supervised Audio-visual Co-segmentation
Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[ICCV-2019]
The Sound of Motions
Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[CVPR-2019]
Deep Multimodal Clustering for Unsupervised Audiovisual Learning
Authors: Di Hu, Feiping Nie, Xuelong Li
Institution: Northwestern Polytechnical University
[CVPR-2021]
Localizing Visual Sounds the Hard Way
Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Institution: University of Oxford
[IEEE TPAMI-2021]
Class-aware Sounding Objects Localization via Audiovisual Correspondence
Authors: Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
Institution: Renmin University of China; Shanghai Jiao Tong University
[IEEE TPAMI-2021]
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Authors: Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
Institution: Korea Advanced Institute of Science and Technology; Pohang University of Science and Technology; University of California
[CVPR-2022]
Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin
[ECCV-2022]
Audio-Visual Segmentation
Authors: Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; 7Shanghai Artificial Intelligence Laboratory
[2022]
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu
Institution: Meta Reality Labs
[ACM MM-2022]
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Authors: Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang
Institution: Shanghai Jiao Tong University; Shanghai AI Laboratory
[CVPR-2022]
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes
Authors: Zengjie Song, Yuxi Wang, Junsong Fan, Tieniu Tan, Zhaoxiang Zhang
Institution: Chinese Academy of Science; University of Chinese Academy of Sciences
[CVPR-2022]
Self-supervised object detection from audio-visual correspondence
Authors: Triantafyllos Afouras; Yuki M. Asano; Francois Fagan; Andrea Vedaldi; Florian Metze
Institution: University of Oxford; University of Amsterdam; Meta AI
[EUSIPCO-2022]
Visually Assisted Self-supervised Audio Speaker Localization and Tracking
Authors: Jinzheng Zhao, Peipei Wu, Shidrokh Goudarzi, Xubo Liu, Jianyuan Sun, Yong Xu, Wenwu Wang
Institution: University of Surrey; Tencent AI Lab, Bellevue
[CVPR-2022]
Mix and Localize: Localizing Sound Sources in Mixtures
Authors: Xixi Hu, Ziyang Chen, Andrew Owens
Institution: University of Michigan; The University of Texas at Austin
[ICASSP-2023]
MarginNCE: Robust Sound Localization with a Negative Margin
Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute, South Korea
[IEEE TMM-2022]
Cross modal video representations for weakly supervised active speaker localization
Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
Institution: University of Southern California; Google Inc.
[NeurIPS-2022]
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Authors: Shentong Mo, Pedro Morgado
Institution: Carnegie Mellon University; University of Wisconsin-Madison
[AAAI-2022]
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou
Institution: The Chinese University of Hong Kong; Zhejiang University; Shanghai Jiao Tong University; Renmin University of China; Nanyang Technological University
[ECCV-2022]
Sound Localization by Self-Supervised Time Delay Estimation
Authors: Ziyang Chen, David F. Fouhey, Andrew Owens
Institution: University of Michigan
[IEEE/ACM TASLP-2023]
Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Authors: Xinyuan Qian, Zhengdong Wang, Jiadong Wang, Guohui Guan, Haizhou Li
Institution: University of Science and Technology Beijing; Chinese University of Hong Kong; Shenzhen Research Institute of Big dataNational University of Singapore; Univeristy of California at Berkeley; University of Bremen
[WACV-2023]
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Authors: Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, Venu Govindaraju
Institution: University at Buffalo
[WACV-2023]
Exploiting Visual Context Semantics for Sound Source Localization
Authors: Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang
Institution: The University of Sydney; Renmin University of China; Baidu Inc.
[2023]
Audio-Visual Segmentation with Semantics
Authors: Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
Institution: Hefei University of Technology; SenseTime Research; University of Oxford; Australian National University; Beihang University; NVIDIA; The University of Hong Kong; Shanghai Artificial Intelligence Laboratory
[CVPR-2023]
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute
[CVPR-2023]
Egocentric Audio-Visual Object Localization
Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Institution: University of Rochester; Meta Reality Labs Research
[CVPR-2023]
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Authors: Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, Nick Barnes
Institution: Australian National University; Beihang University; The University of Oxford; Shanghai AI Lab; OPPO Research Institute
[CVPR-2023]
Audio-Visual Grouping Network for Sound Localization from Mixtures
Authors: Shentong Mo, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas
[ICASSP-2023]
Flowgrad: Using Motion for Visual Sound Source Localization
Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes
Institution: Universitat Pompeu Fabra; New York University
[ACM MM-2023]
Audio-visual segmentation, sound localization, semantic-aware sounding objects localization
Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu
Institution: University of Technology Sydney; The University of Queensland; Futureverse; The Hong Kong University of Science and Technology; CSIRO DATA61; Netease Fuxi AI Lab
[ACM MM-2023]
Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
Authors: Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang
Institution: Northwestern Polytechnical University; Nanchang University
[ACM MM-2023]
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
Authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim
Institution: Kyung Hee University
[IJCAI-2023]
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beihang University; Alibaba Group
[ICCV-2023]
Sound Source Localization is All about Cross-Modal Alignment
Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Harvard University; Pohang University of Science and Technology; Yonsei University
[ICCV-2023]
Multimodal Variational Auto-encoder based Audio-Visual Segmentation
Authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai
Institution: Northwestern Polytechnical University; Shaanxi Key Laboratory of Information Acquisition and Processing; Australian National University; Shanghai AI Laboratory
[WACV-2024]
Can CLIP Help Sound Source Localization?
Authors: Sooyoung Park, Arda Senocak, Joon Son Chung
Institution: Korea Advanced Institute of Science and Technology; Electronics and Telecommunications Research Institute
[NeurIPS-2023]
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Authors: Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou,Siyang Sun, Yun Zheng
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences, Beijing, China; DAMOAcademy, Alibaba Group
[ICASSP-2024]
Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Authors: Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou
Institution: School of Artificial Intelligence, University of Chinese Academy of Sciences; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); Institute of Automation of Chinese Academy of Sciences
[CVPR-2024]
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Authors: Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang
Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University; Shanghai AI Laboratory
[ACM MM-2024]
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
Authors: Xiang He, Xiangxi Liu, Yang Li, Dongcheng Zhao, Guobin Shen, Qingqun Kong, Xin Yang, Yi Zeng
Institution: Brain-inspired Cognitive Intelligence Lab,Institute of Automation, Chinese Academy of Sciences, Beijing, China; Center for Long-term Artificial Intelligence, Beijing, China; Key Laboratory of Brain Cognition and Brain-inspired, Intelligence Technology, CAS, Shanghai, China; Institute of Automation, Chinese Academy of Sciences Beijing, China;
[ACM MM-2024]
Open-Vocabulary Audio-Visual Semantic Segmentation
Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Institution: National Key Laboratory of General, Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China; Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Berkeley AI Research, University of California, Berkeley, Berkeley, CA, USA; College of Information and Electrical Engineering, China Agricultural University, Beijing, China;
[ECCV-2024]
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Authors: Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock
Institution: Seoul National University; Reality Labs Research at Meta
Audio-visual Saliency Detection
[2019]
DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction
Authors: Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala
Institution: Aalto University; Tampere University
[CVPR-2020]
STAViS: Spatio-Temporal AudioVisual Saliency Network
Authors: Antigoni Tsiami, Petros Koutras, Petros Maragos
Institution: National Technical University of Athens
[IEEE TIP-2020]
A Multimodal Saliency Model for Videos With High Audio-visual Correspondence
Authors:
Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, Xinping Guan
Institution: Shanghai Jiao Tong University; University of Macau; Ryerson University
[IROS-2021]
ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction
Authors: Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi
Institution: International Institute for Information Technology; University of Canberra
[CVPR-2021]
From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach
Authors: Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin
Institution: Beihang University; Qingdao University; Chinese Academy of Medical Sciences
[ICME-2021]
Lavs: A Lightweight Audio-Visual Saliency Prediction Model
Authors: Dandan Zhu; Defang Zhao; Xiongkuo Min; Tian Han; Qiangqiang Zhou; Shaobo Yu; Yongqing Chen; Guangtao Zhai; Xiaokang Yang
Institution: Shanghai Jiao Tong University; Tongji University; Stevens Institute of Technology; Jiangxi Normal University; East China Normal University; Hainan University
[2022]
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!
Authors: Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, Muwei Jian
Institution: China University of Petroleum; Shandong University of Finance and Economics; Beijing Information Science and Technology University
[TOMCCAP-2022]
PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection
Authors: Yi Zhang, Fang-Yi Chao, Wassim Hamidouche, Olivier Deforges
Institution: University Rennes; Institut National des Sciences Appliquées Rennes; Centre national de la recherche scientifique; Trinity College Dublin
[CVPR-2023]
CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective
Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University
[CVPR-2023]
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Authors: Chao Feng, Ziyang Chen, Andrew Owens
Institution: University of Michigan
[CVPR-2023]
CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective
Authors: Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University; Shanghai Jiao Tong University
[IJCNN-2023]
3DSEAVNet: 3D-Squeeze-and-Excitation Networks for Audio-Visual Saliency Prediction
Authors: Silong Liang, Chunxiao Li, Naying Cui, Minghui Sun, Hao Xue
Institution: JiLin University
[IEEE TMM-2023]
SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention
Authors: Qin Yang, Yuqi Li, Chenglin Li, Hao Wang, Sa Yan, Li Wei, Wenrui Dai, Junni Zou, Hongkai Xiong, Pascal Frossard
Institution: Shanghai Jiao Tong University; École Polytechnique Fédérale de Lausanne
[TMM-2023]
Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
Authors: Dandan Zhu, Kaiwei Zhang, Nana Zhang, Qiangqiang Zhou, Xiongkuo Min, Guangtao Zhai, Xiaokang Yang
Institution: Institute of AI Education, Shanghai, East China Normal University;Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University; School of Computer Science and Technology, Donghua University; School of Software, Jiangxi Normal University; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
[CVPR-2024]
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
Authors: Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha
Institution: Northwestern Polytechnical University; Ningbo Institute of Northwestern Polytechnical University; Nanchang University
Audio-visual Navigation
[ECCV-2020]
SoundSpaces: Audio-Visual Navigation in 3D Environments
Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman
Institution: The University of Texas at Austin; University of Illinois at Urbana-Champaign; Facebook Reality Labs; Facebook AI Research
[ICRA-2020]
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
Authors: Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum
Institution: MIT-IBM Watson AI Lab; Tsinghua University; Massachusetts Institute of Technology; Google Inc.
[ICLR-2021]
Learning to Set Waypoints for Audio-Visual Navigation
Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[CVPR-2021]
Semantic Audio-Visual Navigation
Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[ICCV-2021]
Move2Hear: Active Audio-Visual Source Separation
Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
Institution: The University of Texas at Austin; Facebook AI Research
[2022]
Sound Adversarial Audio-Visual Navigation
Authors: Yinfeng Yu, Wenbing Huang, Fuchun Sun, Changan Chen, Yikai Wang, Xiaohong Liu
Institution: Tsinghua University; Xinjiang University; The University of Texas at Austin; JD Explore Academy
[CVPR-2022]
Towards Generalisable Audio Representations for Audio-Visual Navigation
Authors: Shunqi Mao, Chaoyi Zhang, Heng Wang, Weidong Cai
Institution: University of Sydney
[NeurIPS-2022]
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul
Institution: The University of Texas at Austin; Reality Labs at Meta; Georgia Tech; Meta AI
[NeurIPS-2022]
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
Authors: Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian
Institution: University of California; Mitsubishi Electric Research Labs, Cambridge
[BMVC-2022]
Pay Self-Attention to Audio-Visual Navigation
Authors: Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, Liejun Wang
Institution: Tsinghua University; Motherbrain, EQT; Xinjiang University
[CVPR-2022]
Finding Fallen Objects Via Asynchronous Audio-Visual Integration
Authors: Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba
Institution: Massachusetts Institute of Technology; MIT-IBM Watson AI Lab
[CVPR-2022]
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer
Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu
Institution: Stanford Univeristy; Carnegie Mellon University
[IEEE RAL-2023]
Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds
Authors: Abdelrahman Younes, Daniel Honerkamp, Tim Welschehold, Abhinav Valada
Institution: University of Freiburg
[2023]
Audio Visual Language Maps for Robot Navigation
Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
Institution: University of Freiburg; Google Research; University of Technology Nuremberg
[ICCV-2023]
Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation
Authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang
Institution: Beihang University; Zhejiang University; The Chinese University of Hong Kong
[IROS-2024]
Audio-Visual Traffic Light State Detection for Urban Robots
Authors: Sagar Gupta, Akansel Cosgun
Institution: Deakin University, Australia
Audio-visual Event Localization and Parsing
Localization
[ECCV-2018]
Audio-visual Event Localization in Unconstrained Videos
Authors: Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
Institution: University of Rochester
[ICASSP-2019]
Dual-modality Seq2Seq Network for Audio-visual Event Localization
Authors: Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang
Institution: National Taiwan University
[ICCV-2019]
Dual Attention Matching for Audio-Visual Event Localization
Authors: Yu Wu, Linchao Zhu, Yan Yan, Yi Yang
Institution: Baidu Research; University of Technology Sydney; Texas State University
[AAAI-2020]
Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan
Institution: Nanjing University of Science and Technology
[ACCV-2020]
Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang
Institution: National Taiwan University; ASUS Intelligent Cloud Services
[WACV-2021]
Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Authors: Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan
Institution: Illinois Institute of Technology; University of Trento; Texas State University
[CVPR-2021]
Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, Meng Wang
Institution: Hefei University of Technology; Intelligent Interconnected Systems Laboratory of Anhui Province; Australian National University
[AIKE-2021]
Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Authors: Qiurui Yue; Xiaoyu Wu; Jiayi Gao
Institution: Communication University of China
[TMM-2021]
Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention
Authors: Cheng Xue, Xionghu Zhong, Minjie Cai, Hao Chen, Wenwu Wang
Institution: Hunan University; United Kingdom of Great Britain and Northern Ireland
[CVPR-2022]
Cross-Modal Background Suppression for Audio-Visual Event Localization
Authors: Yan Xia, Zhou Zhao
Institution: Zhejiang University
[ICASSP-2022]
Bi-Directional Modality Fusion Network For Audio-Visual Event Localization
Authors: Shuo Liu; Weize Quan; Yuan Liu; Dong-Ming Yan
Institution: Chinese Academy of Sciences; Alibaba Group
[ICSIP-2022]
Audio-Visual Event and Sound Source Localization Based on Spatial-Channel Feature Fusion
Authors: Xiaolong Zheng, Ying Wei
Institution: Shandong University
[IJCNN-2022]
Look longer to see better: Audio-visual event localization by exploiting long-term correlation
Authors: Longyin Guo, Qijun Zhao, Hongmei Gao
Institution: Sichuan University; Tibet University
[EUSIPCO-2022]
Audio Visual Graph Attention Networks for Event Detection in Sports Video
Authors: Taichi Ishiwatari, Makiko Azuma, Takuya Handa, Masaki Takahashi, Takahiro Mochizuki, Masanori Sano
Institution: Science and Technology Research Laboratories, NHK; Tokyo Institute of Technology
[IEEE TPAMI-2022]
Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Authors: Jinxing Zhou, Dan Guo, Meng Wang
Institution: Hefei University of Technology
[IEEE TPAMI-2022]
Semantic and Relation Modulation for Audio-Visual Event Localization
Authors: Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo
Institution: University of Science and Technology of China; Chinese Academy of Sciences; University of Rochester
[WACV-2023]
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization
Authors: Tanvir Mahmud, Diana Marculescu
Institution: The University of Texas at Austin
[WACV-2023]
Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding
Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, In So Kweon
Institution: Korea Advanced Institute of Science & Technology; Harvard University; Pohang University of Science and Technology; Adobe Research
[ICASSP-2023]
A dataset for Audio-Visual Sound Event Detection in Movies
Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan
Institution: University of Southern California
[CVPR-2023]
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Authors: Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
Institution: Southern University of Science and Technology; University of Birmingham; The University of Hong Kong; Shandong University; Peng Cheng Laboratory
[CVPR-2023]
Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies
Authors: Bei Gan, Xiujun Shu, Ruizhi Qiao, Haoqian Wu, Keyu Chen, Hanjun Li, Bo Ren
Institution: Tencent YouTu Lab
[ICASSP-2023]
Collaborative Audio-Visual Event Localization Based on Sequential Decision and Cross-Modal Consistency
Authors: Yuqian Kuang, Xiaopeng Fan
Institution: Harbin Institute of Technology; PengCheng Lab
[CVPR-2023]
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory
[IJCNN-2023]
Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization
Authors: Jinqiao Dou, Xi Chen, Yuehai Wang
Institution: Zhejiang University
[AAAI-2023]
Furnishing Sound Event Detection with Language Model Abilities
Authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong Wang
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing Jiaotong University
[IEEE TMM-2023]
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
Authors: Yuanyuan Jiang, Jianqin Yin, Yonghao Dang
Institution: Beijing University of Posts and Telecommunications
[CVPR-2024]
T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
Authors: Tanvir Mahmud, Yapeng Tian, Diana Marculescu
Institution: University of Texas at Austin
[ICME-2024]
Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios
Authors: Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee
Institution: University of Science and Technology of China; iFlytek Research; Georgia Institute of Technology
[ECCV-2024]
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Authors: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
Institution: Gaoling School of Artificial Intelligence, Renmin University of China, China; Beijing University of Posts and Telecommunications, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China; Engineering Research Center of Next-Generation Search and Recommendation
[ECCV-2024]
Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Authors: Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu
Institution: University of Chinese Academy of Sciences; Beijing University of Posts and Telecommunications; Gaoling School of Artificial Intelligence, Renmin University of China, China; Engineering Research Center of Next-Generation Search and Recommendation
[ACM MM-2024]
Unveiling and Mitigating Bias in Audio Visual Segmentation
Authors: Peiwen Sun, Honggang Zhang, Di Hu
Institution: Beijing University of Posts and Telecommunications, Beijing, China; Renmin University of China, Beijing, China
[ICASSP-2025]
A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio
Authors: Xavier Juanola, Gloria Haro, Magdalena Fuentes
Institution: Universitat Pompeu Fabra, Barcelona, Spain; MARL-IDM, New York University, New York, USA;
Parsing
[ECCV-2020]
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Authors: Yapeng Tian, Dingzeyu Li, Chenliang Xu
Institution: University of Rochester; Adobe Research
[CVPR-2021]
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yu Wu, Yi Yang
Institution: Baidu Research; University of Technology Sydney
[NeurIPS-2021]
Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
Authors: Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Institution: National Yang Ming Chiao Tung University; UNC Chapel Hill; University of California, Merced; Snap Research; Google Research; Yonsei University
[2022]
Investigating Modality Bias in Audio Visual Video Parsing
Authors: Piyush Singh Pasi, Shubham Nemani, Preethi Jyothi, Ganesh Ramakrishnan
Institution: Indian Institute of Technology
[ICASSP-2022]
Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
Authors: Penghong Wang, Jiahui Li, Mengyao Ma, Xiaopeng Fan
Institution: Harbin Institute of Technology; Wireless Technology Lab
[ECCV-2022]
Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Authors: Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, Limin Wang
Institution: Nanjing University; SenseTime Research; The Chinese University of Hong Kong; Shanghai AI Laboratory
[NeurIPS-2022]
Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Authors: Shentong Mo, Yapeng Tian
Institution: Carnegie Mellon University; University of Texas at Dallas
[2023]
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Authors: Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang
Institution: Hefei University of Technology; Shanghai AI Lab
[ICASSP-2023]
CM-CS: Cross-Modal Common-Specific Feature Learning For Audio-Visual Video Parsing
Authors: Hongbo Chen, Dongchen Zhu, Guanghui Zhang, Wenjun Shi, Xiaolin Zhang, Jiamao Li
Institution: Chinese Academy of Sciences; ShanghaiTech University; University of Chinese Academy of Sciences
[2023]
Towards Long Form Audio-visual Video Understanding
Authors: Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu
Institution: Renmin University of China; The University of Texas at Dallas
[CVPR-2023]
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio- Visual Event Perception
Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu
Institution: Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory
[ACM MM-2023]
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
Authors: Meng Liu, Ke Liang, Dayu Hu, Hao Yu, Yue Liu, Lingyuan Meng, Wenxuan Tu, Sihang Zhou, Xinwang Liu
Institution: National University of Defense Technology
[WACV-2024]
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Authors: Yating Xu, Conghui Hu, Gim Hee Lee
Institution: National University of Singapore
[ECCV-2024]
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Authors: Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng Wang
Institution: Hefei University of Technology; Anhui Zhonghuitong Technology Co., Ltd.; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Northwestern Polytechnical University; Shanghai AI Laboratory; University of Science and Technology of China; MBZUAI
Audio-visual Question Answering and Dialog
Question Answering
[ICCV-2021]
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos
Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
Institution: Seoul National University; Allen Institute for AI; University of Oxford; Hyundai Motor Company
[CVPR-2022]
Learning To Answer Questions in Dynamic Audio-Visual Scenarios
Authors: Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
Institution: Renmin University of China; Beijing Key Laboratory of Big Data Management and Analysis Methods; University of Rochester
[NeurIPS-2022]
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji
Institution: University of Illinois at Urbana-Champaign; MSR; The University of North Carolina at Chapel Hill; Columbia University
[ACM MM-2023]
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Authors: Guangyao Li, Wenxuan Hou, Di Hu
Institution: Renmin Uniiversity of China
[WACV-2024]
CAD – Contextual Multi-modal Alignment for Dynamic AVQA
Authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa
Institution: University of Surrey; BBC Research and Development
[AAAI-2024]
Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang
Institution: School of Computer Science and Information Engineering, Hefei University of Technology; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
[InterSpeech-2024]
Towards Multilingual Audio-Visual Question Answering
Authors: Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma
Institution: IIIT-Delhi, India; Reliance Jio AICoE, Hyderabad, India; University of Tartu, Estonia;
[ECCV-2024]
Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Authors: Kyu Ri Park, Hong Joo Lee, Jung Uk Kim
Institution: Kyung Hee University, Yong-in, South Korea; Technical University of Munich, Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany
[ACM MM-2024]
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Authors: Guangyao Li, Henghui Du, Di Hu
Institution: GSAI, Renmin University of China, Beijing, China;
Dialog
[CVPR-2019]
Audio Visual Scene-Aware Dialog
Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Institution: Georgia Institute of Technology; Mitsubishi Electric Research Laboratories
[Interspeech-2019]
Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog
Authors: Hori, C.; Cherian, A.; Marks, T.; Hori, T.
Institution: Mitsubishi Electric Research Laboratories, Inc.
[ICASSP-2019]
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh
Institution: Mitsubishi Electric Research Laboratories; Georgia Institute of Technology
[CVPR-2019]
A Simple Baseline for Audio-Visual Scene-Aware Dialog
Authors: Idan Schwartz, Alexander G. Schwing, Tamir Hazan
Institution: Technion; University of Illinois at Urbana-Champaign
[CVPR-2019]
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman
Institution: Anticipatory Computing Lab
[2020]
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
Authors: Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li
Institution: Didi Chuxing
[AAAI-2021]
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
Institution: Rutgers University; The Chinese University of Hong Kong; University of Illinois at Urbana Champaign; Mitsubishi Electric Research Laboratories
[2021]
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
Institution: Columbia University; Facebook AI; Georgia Tech; Dartmouth
[ICASSP-2022]
Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning
Authors: Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori
Institution: Mitsubishi Electric Research Laboratories; Carnegie Mellon University; Rutgers University; The Chinese University of Hong Kong
[WACV-2022]
QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog
Authors: Muchao Ye;Quanzeng You;Fenglong Ma
Institution: University Park; Microsoft Azure Computer Vision
[TACL-2022]
Learning English with Peppa Pig
Authors: Mitja Nikolaus, Afra Alishahi, Grzegorz Chrupała
Institution: Aix-Marseille University; Tilburg University
[2022]
End-to-End Multimodal Representation Learning for Video Dialog
Authors: Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
Institution: Georgia Institute of Technology
[AAAI-2022]
Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations
Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima
Institution: Nippon Telegraph and Telephone Corporation
[IEEE/ACM TASLP-2023]
DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
Authors: Zhe Chen, Hongcheng Liu, Yu Wang
Institution: Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory
[IEEE-ACM T AUDIO SPE-2023]
DialogMCF: Multimodal Context Flow for Audio
Visual Scene-Aware Dialog
Authors: Zhe Chen, Hongcheng Liu, Yu Wang
Institution: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University
Datasets
Dataset | Year | Videos | Length | Data form | Video source | Task |
---|---|---|---|---|---|---|
LRW, LRS2 and LRS3 | 2016,2018, 2018 | - | 800h+ | video | in the wild | Speech-related, speaker-related,face generation-related tasks |
VoxCeleb, VoxCeleb2 | 2017, 2018 | - | 2,000h+ | video | YouTube | Speech-related, speaker-related,face generation-related tasks |
AVA-ActiveSpeaker | 2019 | - | 38.5h | video | YouTube | Speech-related task, speaker-related task |
Kinetics-400 | 2017 | 306,245 | 850h+ | video | YouTube | Action recognition |
EPIC-KITCHENS | 2018 | 39,594 | 55h | video | Recorded videos | Action recognition |
CMU-MOSI | 2016 | 2,199 | 2h+ | video | YouTube | Emotion recognition |
CMU-MOSEI | 2018 | 23,453 | 65h+ | video | YouTube | Emotion recognition |
VGGSound | 2020 | 200k+ | 550h+ | video | YouTube | Action recognition, sound localization |
AudioSet | 2017 | 2M+ | 5,800h+ | video | YouTube | Action recognition, sound sepearation |
Greatest Hits | 2016 | 977 | 9h+ | video | Recorded videos | Sound generation |
MUSIC | 2018 | 714 | 23h+ | video | YouTube | Sound seperation, sound localization |
FAIR-Play | 2019 | 1,871 | 5.2h | video with binaural sound | Recorded videos | Spatial sound generation |
YT-ALL | 2018 | 1,146 | 113.1h | 360 video | YouTube | Spatial sound generation |
Replica | 2019 | - | - | 3D environment | 3D simulator | Depth estimation |
AIST++ | 2021 | - | 5.2h | 3D video | Recorded videos | Dance generation |
TED | 2019 | - | 52h | video | TED talks | Gesture generation |
SumMe | 2014 | 25 | 1h+ | video with eye-tracking | User videos | Saliency detection |
AVE | 2018 | 4,143 | 11h+ | video | YouTube | Event localization |
LLP | 2020 | 11,849 | 32.9h | video | YouTube | Event parsing |
SoundSpaces | 2020 | - | - | 3D environment | 3D simulator | Audio-visual navigation |
AVSD | 2019 | 11,816 | 98h+ | video with dialog | Crowd-sourced | Audio-visual dialog |
Pano-AVQA | 2021 | 5.4k | 7.7h | 360 video with QA | Video-sharing platforms | Audio-visual question answering |
MUSIC-AVQA | 2022 | 9,288 | 150h+ | video with QA | YouTube | Audio-visual question answering |
AVSBench | 2022 | 5,356 | 14.8h+ | video | YouTube | Audio-visual segmentation, sound localization |
RAF | 2024 | - | 95h+ | 3D environment | Recorded videos | Spatial Sound Generation |
SPD | 2024 | - | 3.0h | Multi-view video | Recorded videos | Action recognition |
VoxBlink2 | 2024 | 2,097,062 | 16672h | video | YouTube | Speaker identification |
BEWO-1M | 2024 | 1M+ | 2.4kh+ | video | YouTube | Spatial sound generation |