Learning to Answer Questions in Dynamic Audio-Visual Scenarios

[Oral Presentation, CVPR2022]

    Guangyao Li1,†, Yake Wei1,†, Yapeng Tian2,†, Chenliang Xu2, Ji-Rong Wen1, Di Hu1,*
    1Renmin University of China, 2University of Rochester

[Paper]  [Supplementary]  [arXiv]  [Poster]  [Video-YouTube]  [Video-Bilibili]  [Code]


  • 01 Jun 2022: The dataset has been uploaded to Google Drive, welcome to download and use!
  • 28 Mar 2022: Camera-ready version has been released here!
  • 22 Mar 2022: The MUSIC-AVQA dataset has been released, please see Download for details.
  • 18 Mar 2022: Code has been released here!
  • 08 Mar 2022: Watch the project's video demonstration on YouTube or Bilibili.
  • 02 Mar 2022: Our paper is accepted for publication at CVPR2022. Camera-ready version and code will be released soon!

What is AVQA task?

We are surrounded by audio and visual messages in daily life, and both modalities jointly improve our ability in scene perception and understanding. For instance, imagining that we are in a concert, watching the performance and listening to the music at the same time contribute to better enjoyment of the show. Inspired by this, how to make machines integrate multimodal information, especially the natural modality such as the audio and visual ones, to achieve considerable scene perception and understanding ability as humans is an interesting and valuable topic. We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

For instance, as shown in this Figure, when answering the audio-visual question “Which clarinet makes the sound first” for this instrumental ensemble, it requires to locate sounding objects “clarinet” in the audio-visual scenario and focus on the “first” sounding “clarinet” in the timeline. To answer the question correctly, both effective audio-visual scene understanding and spatio-temporal reasoning are essentially desired.

Audio-visual question answering requires auditory and visual modalities for multimodal scene understanding and spatiotemporal reasoning. For example, when we encounter a complex musical performance scene involving multiple sounding and nonsounding instruments above, it is difficult to analyze the sound first term in the question by VQA model that only considers visual modality. While if we only consider the AQA model with mono sound, the left or right position is also hard to be recognized. However, we can see that using both auditory and visual modalities can answer this question effortlessly.

What is MUSIC-AVQA dataset?

To explore scene understanding and spatio-temporal reasoning over audio and visual modalities, we build a largescale audio-visual dataset, MUSIC-AVQA, which focuses on question-answering task. As noted above, high-quality datasets are of considerable value for AVQA research.

Why musical performance? Considering that musical performance is a typical multimodal scene consisting of abundant audio and visual components as well as their interaction, it is appropriate to be utilized for the exploration of effective audio-visual scene understanding and reasoning.

Basic informations

We choose to manually collect amounts of musical performance videos from YouTube. Specifically, 22 kinds of instruments, such as guitar, cello, and xylophone, are selected and 9 audio-visual question types are accordingly designed, which cover three different scenarios, i.e., audio, visual and audio-visual. Annotations are collected using a novel by our GSAI-Labeled system.


  • 3 typical multimodal scene
  • 22 kinds of instruments
  • 4 categories: String, Wind, Percussion and Keyboard.
  • 9,290 videos for over 150 hours
  • 7,423 real videos
  • 1,867 synthetic videos
  • 9 audio-visual question types
  • 45,867 question-answer pairs
  • Diversity, complexity and dynamic

Personal data/Human subjects

Videos in MUSIC-AVQA are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.


Some graphical representations of our dataset and annotations

Illustrations of our MUSIC-AVQA dataset statistics. (a-d) statistical analysis of the videos and QA pairs. (e) Question formulas. (f) Distribution of question templates, where the dark color indicates the number of QA pairs generated from real videos while the light-colored area on the upper part of each bar means that from synthetic videos. (g) Distribution of first n-grams in questions. Our QA-pairs need fine-grained scene understanding and spatio-temporal reasoning over audio and visual modalities to be solved. For example, existential and location questions require spatial reasoning, and temporal questions require temporal reasoning. Best viewed in color.

Comparison with other video QA datasets. Our MUSIC-AVQA dataset focuses on the interaction between visual objects and their produced sounds, offering QA pairs that covering audio questions, visual questions and audio-visual questions, which is more comprehensive than other datasets. The collected videos in MUSIC-AVQA can facilitate audio-visual understanding in terms of spatial and temporal associations.

How was MUSIC-AVQA dataset made?

We design an audio-visual question answering labeling system to collect questions, and all QA pairs are collected with this system. The flow chart of the labeling system is shown in below figure.

Labeling system contains questioning and answering. In the questioning section, the annotator is required to select the performance type of the video and the included instruments, and then scene types, question types, and question templates, and finally one question is automatically generated based on the previous selection. In the answering part, the annotator to judge whether the question is reasonable, and if it is unreasonable, the question will be labeled again. Then, the annotator answering the question according to video content, and finally one QA pair is produced.

QA pairs samples

Demo. The large-scale spatial-temporal audio-visual dataset that focuses on question-answering task, as shown in below figure that different audio-visual scene types and their annotated QA pairs in the AVQA dataset.

In the first row, a), b), and c) represent real musical performance videos, namely solo, ensemble of the same instrument, and ensemble of different instruments. In the second row, d), e), and f) represent the synthetic video, which are audio and video random matching, audio overlay, and video stitching, respectively.

More video examples

Some video examples with QA pairs in the MUSIC-AVQA dataset. Through these examples, we can have a better understanding of the dataset, and can more intuitively feel the QA tasks in dynamic and complex audio-visual scenes

Question: How many instruments are sounding in the video?
Answer: two
To answer the question, an AVQA model needs to first identify objects and sound sources in the video, and then count all sounding objects. Although there are three different sound sources in the audio modality, only two of them are visible. Rather than simply counting all audio and visual instances, exploiting audio-visual association is important for AVQA.
Question: What is the first sounding instrument?
Answer: piano
To answer the question, an AVQA model needs to not only associate all instruments and their corresponding sounds in the video, but also identify the first instrument that makes sounds. Thus, the AVQA task is not a simple recognition problem, and it also involves audio-visual association and temporal reasoning.
Question: What is the left instrument of the second sounding instrument?
Answer: guzheng
To answer the question, an AVQA model needs to first identify the second sounding instrument: flute and then infer the instrument at its left. Besides recognizing objects, exploring audio-visual association, and performing temporal reasoning, spatial reasoning is also crucial for AVQA.


Dataset publicly available for research purposes

Data and Download

Raw videos:

  • Google Drive
  • Baidu Drive (password: cvpr)
  • - Real videos (36.67GB)
    - Synthetic videos (11.59GB)
    Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

Raw videos frames (1fps): Available at Baidu Drive (14.84GB) (pwd: cvpr) or Google Drive (coming soon!). In fact, we thought it might be more convenient to execute the code above to extract video frames.

Features (VGGish, ResNet18 and R(2+1)D):

Annotations (QA pairs, etc.): Available for download at GitHub.

How to read the annotation files?

The annotation files are stored in JSON format. Each annotation file contains seven different keyword: "video_id", "question_id", "type", "question_content", "templ_values", "question_deleted" and "anser". Below, we present a detailed explanation of each keyword.

  • "type": the question's modality information and type.
  • "question_id": the unique identifier to QA pairs. .
  • "video_id", "question_content", "templ_values" and "anser": The contents of these keywords together construct the Q-A pairs corresponding to the video with id "video_id". The form of <FL> is the template word in the question, and its specific content is the information contained in "templ_values". See the paper for a more specific question template description
  • "question_deleted": The check code during data annotation process and it will not be used in the dataloader.

Below, we show an example entry from the JSON file:

          "video_id": "00000272",
          "question_id": 50,
          "type": "[\"Audio-Visual\", \"Temporal\"]",
          "question_content": "Where is the <FL> sounding instrument?",
          "templ_values": "[\"first\"]",
          "question_deleted": 0,
          "anser": "right"

The example shows the information and annotations related to the QA pairs corresponding to the video with id "video_id". As noted, we assign the unique identifier "50" to that QA pairs. From the entry, we can retrieve the video name, questions, answer and type where the QA pairs belongs. For example, the example above needs to combine audio and visual modality information to make a correct answer to this temporal question.


If you find our work useful in your research, please cite our CVPR 2022 paper.

          title={Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
          author={Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
          journal   = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          year      = {2022},


The released MUSIC-AVQA dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.

Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

A Simple Baseline for MUSIC-AVQA

Audio-visual spatial-temporal MUSIC-AVQA method, experimental results and simple analysis

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. And to benchmark different models, we use answer prediction accuracy as the evaluation metric and evaluate performance of different models on answering different types of audio, visual, and audio-visual questions. More details in the [Paper] and [Supplementary].

Spatio-temporal Grounding Model

An overview of the proposed framework is illustrated in below figure. We consider that the sound and the location of its visual source usually reflects the spatial association between audio and visual modality, the spatial grounding module, which performs attention-based sound source localization, is therefore introduced to decompose the complex scenarios into concrete audio-visual association. To highlight the key timestamps that are closely associated to the question, we propose a temporal grounding module, which is designed for attending critical temporal segments among the changing audio-visual scenes and capturing question-aware audio and visual embeddings.

The proposed audio-visual question answering model takes pre-trained CNNs to extract audio and visual features and uses a LSTM to obtain a question embedding. We associate specific visual locations with the input sounds to perform spatial grounding, based on which audio and visual features of key timestamps are further highlighted via question query for temporal grounding. Finally, multimodal fusion is exploited to integrate audio, visual, and question information for predicting the answer to the input question.


To study different input modalities and validate the effectiveness of the proposed model, we conduct extensive ablations of our model and compare to recent QA approaches.

As shown in right table, we observe that leveraging audio, visual, and question information can boost AVQA task. The below table shows that audio-visual video questiFon answering results of different methods on the test set of MUSIC-AVQA. And the top-2 results are highlighted.

The results firstly demonstrate that all AVQA methods outperform A-, V- and VideoQA methods, which indicates that AVQA task can be boosted through multisensory perception. Secondly, our method achieves considerable improvement on most audio and visual questions. For the audio-visual question that desires spatial and temporal reasoning, our method is clearly superior over other methods on most question types, especially on answering the Counting and Location questions. Moreover, the results confirm the potential of our dataset as a testbed for audio-visual scene understanding.

Visualized spatio-temporal grounding results

We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

Visualized spatio-temporal grounding results. Based on the grounding results of our method, the sounding area and key timestamps are accordingly highlighted in spatial and temporal perspectives (a-e), respectively, which indicates that our method can model the spatio-temporal association over different modalities well, facilitating the scene understanding and reasoning. Besides, the subfigure (f) shows one failure case predicted by our method, where the complex scenario with multiple sounding and silent objects makes it difficult to correlate individual objects with mixed sound, leading to a wrong answer for the given question.

The Team

We are a group of researchers working in computer vision from the Renmin University of China and University of Rochester.

Guangyao Li

PhD Candidate
(Sep 2020 - )
Renmin University of China

Yake Wei

PhD Candidate
(Sep 2021 - )
Renmin University of China

Yapeng Tian

PhD Candidate
(Sep 2017 - )
University of Rochester

Chenliang Xu

Assistant Professor
University of Rochester

Ji-Rong Wen

Renmin University of China

Di Hu

Assistant Professor
Renmin University of China


  • This research was supported by Public Computing Cloud, Renmin University of China.
  • This web-page design inspired by EPIC official website.