[Paper] [Supplementary] [arXiv] [Poster] [Video-YouTube] [Video-Bilibili] [Code]
We are surrounded by audio and visual messages in daily life, and both modalities jointly improve our ability in scene perception and understanding. For instance, imagining that we are in a concert, watching the performance and listening to the music at the same time contribute to better enjoyment of the show. Inspired by this, how to make machines integrate multimodal information, especially the natural modality such as the audio and visual ones, to achieve considerable scene perception and understanding ability as humans is an interesting and valuable topic. We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.
For instance, as shown in this Figure, when answering the audio-visual question “Which clarinet makes the sound first” for this instrumental ensemble, it requires to locate sounding objects “clarinet” in the audio-visual scenario and focus on the “first” sounding “clarinet” in the timeline. To answer the question correctly, both effective audio-visual scene understanding and spatio-temporal reasoning are essentially desired.
Audio-visual question answering requires auditory and visual modalities for multimodal scene understanding and spatiotemporal reasoning. For example, when we encounter a complex musical performance scene involving multiple sounding and nonsounding instruments above, it is difficult to analyze the sound first term in the question by VQA model that only considers visual modality. While if we only consider the AQA model with mono sound, the left or right position is also hard to be recognized. However, we can see that using both auditory and visual modalities can answer this question effortlessly.
To explore scene understanding and spatio-temporal reasoning over audio and visual modalities, we build a largescale audio-visual dataset, MUSIC-AVQA, which focuses on question-answering task. As noted above, high-quality datasets are of considerable value for AVQA research.
Why musical performance? Considering that musical performance is a typical multimodal scene consisting of abundant audio and visual components as well as their interaction, it is appropriate to be utilized for the exploration of effective audio-visual scene understanding and reasoning.
We choose to manually collect amounts of musical performance videos from YouTube. Specifically, 22 kinds of instruments, such as guitar, cello, and xylophone, are selected and 9 audio-visual question types are accordingly designed, which cover three different scenarios, i.e., audio, visual and audio-visual. Annotations are collected using a novel by our GSAI-Labeled system.
Videos in MUSIC-AVQA are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.
Illustrations of our MUSIC-AVQA dataset statistics. (a-d) statistical analysis of the videos and QA pairs. (e) Question formulas. (f) Distribution of question templates, where the dark color indicates the number of QA pairs generated from real videos while the light-colored area on the upper part of each bar means that from synthetic videos. (g) Distribution of first n-grams in questions. Our QA-pairs need fine-grained scene understanding and spatio-temporal reasoning over audio and visual modalities to be solved. For example, existential and location questions require spatial reasoning, and temporal questions require temporal reasoning. Best viewed in color.
Comparison with other video QA datasets. Our MUSIC-AVQA dataset focuses on the interaction between visual objects and their produced sounds, offering QA pairs that covering audio questions, visual questions and audio-visual questions, which is more comprehensive than other datasets. The collected videos in MUSIC-AVQA can facilitate audio-visual understanding in terms of spatial and temporal associations.
We design an audio-visual question answering labeling system to collect questions, and all QA pairs are collected with this system. The flow chart of the labeling system is shown in below figure.
Labeling system contains questioning and answering. In the questioning section, the annotator is required to select the performance type of the video and the included instruments, and then scene types, question types, and question templates, and finally one question is automatically generated based on the previous selection. In the answering part, the annotator to judge whether the question is reasonable, and if it is unreasonable, the question will be labeled again. Then, the annotator answering the question according to video content, and finally one QA pair is produced.
Demo. The large-scale spatial-temporal audio-visual dataset that focuses on question-answering task, as shown in below figure that different audio-visual scene types and their annotated QA pairs in the AVQA dataset.
In the first row, a), b), and c) represent real musical performance videos, namely solo, ensemble of the same instrument, and ensemble of different instruments. In the second row, d), e), and f) represent the synthetic video, which are audio and video random matching, audio overlay, and video stitching, respectively.
Some video examples with QA pairs in the MUSIC-AVQA dataset. Through these examples, we can have a better understanding of the dataset, and can more intuitively feel the QA tasks in dynamic and complex audio-visual scenes
Question: How many instruments are sounding in the video? Answer: two To answer the question, an AVQA model needs to first identify objects and sound sources in the video, and then count all sounding objects. Although there are three different sound sources in the audio modality, only two of them are visible. Rather than simply counting all audio and visual instances, exploiting audio-visual association is important for AVQA. |
Question: What is the first sounding instrument? Answer: piano To answer the question, an AVQA model needs to not only associate all instruments and their corresponding sounds in the video, but also identify the first instrument that makes sounds. Thus, the AVQA task is not a simple recognition problem, and it also involves audio-visual association and temporal reasoning. |
Question: What is the left instrument of the second sounding instrument? Answer: guzheng To answer the question, an AVQA model needs to first identify the second sounding instrument: flute and then infer the instrument at its left. Besides recognizing objects, exploring audio-visual association, and performing temporal reasoning, spatial reasoning is also crucial for AVQA. |
Raw videos:
Raw videos frames (1fps): Available at Baidu Drive (14.84GB) (pwd: cvpr) or Google Drive (coming soon!). In fact, we thought it might be more convenient to execute the code above to extract video frames.
Features (VGGish, ResNet18 and R(2+1)D):
Annotations (QA pairs, etc.): Available for download at GitHub.
The annotation files are stored in JSON format. Each annotation file contains seven different keyword: "video_id", "question_id", "type", "question_content", "templ_values", "question_deleted" and "anser". Below, we present a detailed explanation of each keyword.
Below, we show an example entry from the JSON file:
{
"video_id": "00000272",
"question_id": 50,
"type": "[\"Audio-Visual\", \"Temporal\"]",
"question_content": "Where is the <FL> sounding instrument?",
"templ_values": "[\"first\"]",
"question_deleted": 0,
"anser": "right"
}
The example shows the information and annotations related to the QA pairs corresponding to the video with id "video_id". As noted, we assign the unique identifier "50" to that QA pairs. From the entry, we can retrieve the video name, questions, answer and type where the QA pairs belongs. For example, the example above needs to combine audio and visual modality information to make a correct answer to this temporal question.
If you find our work useful in your research, please cite our CVPR 2022 paper.
@ARTICLE{Li2022Learning,
title={Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
author={Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022},
}
The released MUSIC-AVQA dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.
All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. And to benchmark different models, we use answer prediction accuracy as the evaluation metric and evaluate performance of different models on answering different types of audio, visual, and audio-visual questions. More details in the [Paper] and [Supplementary].
An overview of the proposed framework is illustrated in below figure. We consider that the sound and the location of its visual source usually reflects the spatial association between audio and visual modality, the spatial grounding module, which performs attention-based sound source localization, is therefore introduced to decompose the complex scenarios into concrete audio-visual association. To highlight the key timestamps that are closely associated to the question, we propose a temporal grounding module, which is designed for attending critical temporal segments among the changing audio-visual scenes and capturing question-aware audio and visual embeddings.
The proposed audio-visual question answering model takes pre-trained CNNs to extract audio and visual features and uses a LSTM to obtain a question embedding. We associate specific visual locations with the input sounds to perform spatial grounding, based on which audio and visual features of key timestamps are further highlighted via question query for temporal grounding. Finally, multimodal fusion is exploited to integrate audio, visual, and question information for predicting the answer to the input question.
To study different input modalities and validate the effectiveness of the proposed model, we conduct extensive ablations of our model and compare to recent QA approaches.
As shown in right table, we observe that leveraging audio, visual, and question information can boost AVQA task. The below table shows that audio-visual video questiFon answering results of different methods on the test set of MUSIC-AVQA. And the top-2 results are highlighted.
The results firstly demonstrate that all AVQA methods outperform A-, V- and VideoQA methods, which indicates that AVQA task can be boosted through multisensory perception. Secondly, our method achieves considerable improvement on most audio and visual questions. For the audio-visual question that desires spatial and temporal reasoning, our method is clearly superior over other methods on most question types, especially on answering the Counting and Location questions. Moreover, the results confirm the potential of our dataset as a testbed for audio-visual scene understanding.
We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.
Visualized spatio-temporal grounding results. Based on the grounding results of our method, the sounding area and key timestamps are accordingly highlighted in spatial and temporal perspectives (a-e), respectively, which indicates that our method can model the spatio-temporal association over different modalities well, facilitating the scene understanding and reasoning. Besides, the subfigure (f) shows one failure case predicted by our method, where the complex scenario with multiple sounding and silent objects makes it difficult to correlate individual objects with mixed sound, leading to a wrong answer for the given question.
We are a group of researchers working in computer vision from the Renmin University of China and University of Rochester.