Towards Long Form Audio-visual Video Understanding

    Wenxuan Hou1,†, Guangyao Li1,†, Yapeng Tian2, Di Hu1,*
    1Renmin University of China, 2UT Dallas

[Paper]


Why Long form Audio-visual Video Understanding?

We live in a world filled with never-ending streams of multimodal information. Videos captured from natural scenes have two typical characteristics: 1) Long form. They usually span several minutes, covering multiple related events in different categories. These events usually jointly contribute to depicting the main content of the video. 2) Audio-visual. Videos recorded in real-world scenarios usually comprise both audio and visual modalities. These two aspects often exhibit asynchrony, providing unique perspectives in delineating the video content, yet collaboratively facilitating video understanding.

We show an example of long form audio-visual videos, with a length of 121-second. This video shows a badminton game. The audio modality contains three events: cheering, clapping and speech, the visual modality contains five events: playing badminton, cheering, crying, laughing, and clapping. The event cheering and clapping appears both in audio modality and visual modality. These modality-aware events as well as their inherent relations help to effectively infer what happens in the video then achieve a better understanding of the video content. Considering the merits of the above two characteristics, we propose to study video understanding in terms of long form and audio-visual aspect, name as \emph{long form audio-visual video understanding.


What is the multisensory temporal event localization task?

Task Definition. To achieve a better understanding of long form audio-visual videos, we propose to focus on the multisensory temporal event localization task, which essentially requires the model to predict the start and end time of each audio and visual event in the video. Concretely, we divide the video into several non-overlapping snippets, then predict the event categories of all snippets.

Challenges. Firstly, the video contains multiple events with diverse categories, modalities, and varying lengths. Secondly, understanding the video content requires effectively modeling long-range dependencies and relations across different clips and modalities.

What is LFAV dataset?

To study the proposed multisensory temporal event localization task, we elaborately build a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos, as existing datasets are not appropriate for our proposed task. Information and highlights of the LFAV dataset are shown below.

Basic informations

We collect videos from YouTube, covering five kinds of daily life to ensure the diversity, complexity, and dynamic of the real world: human-related, sports, musical instruments, tools, and animals. We also construct a label set of 35 kinds of events covering the above scenes,

Characteristics

  • 5,175 videos
  • average length of 210 seconds and total length of 302 hours
  • average event categories of 3.15 per video.
  • Modality-aware annotations for all videos
  • Diversity, complexity and dynamic

Personal data/Human subjects

Videos in LFAV are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.

LFAV Dataset

Illustrations of our LFAV dataset statistics

Illustrations of our LFAV dataset statistics. (a-d) Statistical analysis of label categories, including the distribution of event numbers in each video; the distribution of video length; the proportion of the top 4 event categories, the top 4 labels represent speech, clapping, cheering, and laughing, which are the most common human actions; the temporal proportion of events that occur on two modalities at the same time. (e) Second-order interactions between all labels, the thicker the line, the closer the association. (f) Distribution of dataset labels black of each category.

Comparison with other datasets. Our LFAV dataset is collected for the proposed multisensory temporal event localization task, where diversified domains are covered. Specifically, the LFAV dataset offers modality-aware annotations for each video, that is it points out the events are from audio, visual, or both modalities. Meanwhile, multiple events with different semantic categories per video are also annotated for better exploring the relation among events. Videos in the dataset have an average length of 210 seconds and a total length of 302 hours. (* means LLP only provides modality-aware annotations in validation and testing sets.)


Video examples


Some video examples in the LFAV dataset. Each of them contains multiple modality aware events.

Visual Event Labels
-6v1PqgFrDk 0 36 speech
-6v1PqgFrDk 20 415 playing_ping-pong
-6v1PqgFrDk 39 47 speech
-6v1PqgFrDk 52 59 speech
-6v1PqgFrDk 63 87 speech
-6v1PqgFrDk 94 103 speech
-6v1PqgFrDk 108 122 speech
-6v1PqgFrDk 127 187 speech
-6v1PqgFrDk 197 231 speech
-6v1PqgFrDk 240 251 speech
-6v1PqgFrDk 258 268 speech
-6v1PqgFrDk 280 362 speech
-6v1PqgFrDk 367 377 speech
-6v1PqgFrDk 380 415 speech
Visual Event Labels
DBejGH1-UCI 0 314 piano
DBejGH1-UCI 18 27 laughter
DBejGH1-UCI 34 65 laughter
DBejGH1-UCI 77 86 laughter
DBejGH1-UCI 111 154 laughter
DBejGH1-UCI 175 185 laughter
DBejGH1-UCI 208 216 laughter
DBejGH1-UCI 228 248 laughter
DBejGH1-UCI 267 303 laughter
DBejGH1-UCI 297 298 speech
DBejGH1-UCI 315 339 speech
DBejGH1-UCI 332 336 laughter
DBejGH1-UCI 342 346 laughter
DBejGH1-UCI 341 342 speech
DBejGH1-UCI 350 352 speech
DBejGH1-UCI 358 362 speech
DBejGH1-UCI 369 372 speech
DBejGH1-UCI 375 381 speech
DBejGH1-UCI 387 390 speech
DBejGH1-UCI 394 396 speech
DBejGH1-UCI 397 417 laughter
DBejGH1-UCI 402 403 speech
DBejGH1-UCI 406 408 speech
DBejGH1-UCI 410 414 speech
DBejGH1-UCI 418 423 speech
DBejGH1-UCI 428 430 speech
DBejGH1-UCI 431 433 speech
DBejGH1-UCI 431 435 laughter
Visual Event Labels
45RDZ3owtvY 0 23 helicopter
45RDZ3owtvY 23 107 chainsaw
45RDZ3owtvY 110 112 chainsaw
45RDZ3owtvY 151 166 chainsaw
45RDZ3owtvY 187 197 chainsaw
45RDZ3owtvY 212 235 chainsaw
45RDZ3owtvY 242 280 chainsaw
45RDZ3owtvY 288 294 chainsaw
45RDZ3owtvY 304 308 chainsaw
45RDZ3owtvY 316 320 chainsaw
45RDZ3owtvY 325 562 chainsaw
45RDZ3owtvY 385 390 speech
45RDZ3owtvY 590 595 helicopter
Audio Event Labels
-6v1PqgFrDk 0 36 speech
-6v1PqgFrDk 24 39 playing_ping-pong
-6v1PqgFrDk 39 47 speech
-6v1PqgFrDk 48 62 playing_ping-pong
-6v1PqgFrDk 52 59 speech
-6v1PqgFrDk 63 87 speech
-6v1PqgFrDk 74 83 playing_ping-pong
-6v1PqgFrDk 87 93 playing_ping-pong
-6v1PqgFrDk 94 103 speech
-6v1PqgFrDk 104 107 playing_ping-pong
-6v1PqgFrDk 108 122 speech
-6v1PqgFrDk 123 126 playing_ping-pong
-6v1PqgFrDk 127 187 speech
-6v1PqgFrDk 141 144 playing_ping-pong
-6v1PqgFrDk 159 162 playing_ping-pong
-6v1PqgFrDk 167 169 playing_ping-pong
-6v1PqgFrDk 174 177 playing_ping-pong
-6v1PqgFrDk 180 196 playing_ping-pong
-6v1PqgFrDk 197 231 speech
-6v1PqgFrDk 200 201 playing_ping-pong
-6v1PqgFrDk 230 238 playing_ping-pong
-6v1PqgFrDk 240 251 speech
-6v1PqgFrDk 252 257 playing_ping-pong
-6v1PqgFrDk 258 268 speech
-6v1PqgFrDk 269 279 playing_ping-pong
-6v1PqgFrDk 280 362 speech
-6v1PqgFrDk 283 287 playing_ping-pong
-6v1PqgFrDk 292 304 playing_ping-pong
-6v1PqgFrDk 357 366 playing_ping-pong
-6v1PqgFrDk 367 377 speech
-6v1PqgFrDk 378 394 playing_ping-pong
-6v1PqgFrDk 380 415 speech
Audio Event Labels
DBejGH1-UCI 0 3 cheering
DBejGH1-UCI 3 5 laughter
DBejGH1-UCI 1 3 speech
DBejGH1-UCI 5 9 speech
DBejGH1-UCI 10 13 speech
DBejGH1-UCI 8 14 cheering
DBejGH1-UCI 8 13 clapping
DBejGH1-UCI 17 20 laughter
DBejGH1-UCI 21 297 piano
DBejGH1-UCI 20 23 speech
DBejGH1-UCI 36 58 singing
DBejGH1-UCI 59 165 singing
DBejGH1-UCI 165 173 clapping
DBejGH1-UCI 177 285 singing
DBejGH1-UCI 286 302 clapping
DBejGH1-UCI 297 298 speech
DBejGH1-UCI 299 303 cheering
DBejGH1-UCI 301 329 speech
DBejGH1-UCI 330 340 speech
DBejGH1-UCI 331 334 laughter
DBejGH1-UCI 340 350 cheering
DBejGH1-UCI 341 343 laughter
DBejGH1-UCI 350 398 speech
DBejGH1-UCI 398 402 cheering
DBejGH1-UCI 402 404 speech
DBejGH1-UCI 404 406 cheering
DBejGH1-UCI 406 438 speech
DBejGH1-UCI 408 413 cheering
DBejGH1-UCI 419 438 clapping
DBejGH1-UCI 433 436 cheering
DBejGH1-UCI 436 438 speech
Audio Event Labels
45RDZ3owtvY 0 23 helicopter
45RDZ3owtvY 23 112 chainsaw
45RDZ3owtvY 112 136 helicopter
45RDZ3owtvY 136 445 chainsaw
45RDZ3owtvY 304 305 laughter
45RDZ3owtvY 387 391 speech
45RDZ3owtvY 456 457 laughter
45RDZ3owtvY 462 462 speech
45RDZ3owtvY 463 590 chainsaw
45RDZ3owtvY 590 595 helicopter

Visual Event Labels
08hzunIk81Y 7 31 bicycle
08hzunIk81Y 7 8 car
08hzunIk81Y 10 17 car
08hzunIk81Y 20 21 car
08hzunIk81Y 27 57 car
08hzunIk81Y 47 48 dance
08hzunIk81Y 52 53 dog
08hzunIk81Y 58 60 bicycle
08hzunIk81Y 59 60 car
08hzunIk81Y 61 62 speech
08hzunIk81Y 62 94 car
08hzunIk81Y 63 76 bicycle
08hzunIk81Y 80 81 bicycle
08hzunIk81Y 86 94 bicycle
08hzunIk81Y 92 93 speech
08hzunIk81Y 95 96 speech
08hzunIk81Y 96 104 bicycle
08hzunIk81Y 98 104 car
08hzunIk81Y 107 109 bicycle
08hzunIk81Y 107 109 car
08hzunIk81Y 111 121 bicycle
08hzunIk81Y 111 119 car
08hzunIk81Y 122 123 bicycle
08hzunIk81Y 125 126 bicycle
08hzunIk81Y 129 137 bicycle
08hzunIk81Y 131 136 car
08hzunIk81Y 136 137 laughter
08hzunIk81Y 138 151 bicycle
08hzunIk81Y 138 144 car
08hzunIk81Y 140 142 laughter
08hzunIk81Y 140 142 clapping
08hzunIk81Y 153 175 bicycle
08hzunIk81Y 155 156 car
08hzunIk81Y 160 163 car
08hzunIk81Y 171 173 car
08hzunIk81Y 174 175 speech
08hzunIk81Y 183 196 bicycle
08hzunIk81Y 185 186 speech
08hzunIk81Y 190 192 laughter
08hzunIk81Y 190 192 clapping
08hzunIk81Y 194 200 car
08hzunIk81Y 197 198 speech
08hzunIk81Y 203 204 speech
08hzunIk81Y 204 225 bicycle
08hzunIk81Y 209 213 car
08hzunIk81Y 217 225 car
08hzunIk81Y 226 227 laughter
08hzunIk81Y 227 228 bicycle
08hzunIk81Y 227 232 car
08hzunIk81Y 230 275 bicycle
08hzunIk81Y 246 271 car
08hzunIk81Y 276 288 bicycle
08hzunIk81Y 281 288 car
08hzunIk81Y 289 290 car
08hzunIk81Y 297 347 speech
08hzunIk81Y 297 347 car
08hzunIk81Y 300 305 dog
08hzunIk81Y 318 321 dog
08hzunIk81Y 335 336 bicycle
Visual Event Labels
DFYOtfJsXgM 5 11 playing_baseball
DFYOtfJsXgM 11 20 playing_badminton
DFYOtfJsXgM 20 50 speech
DFYOtfJsXgM 50 62 playing_baseball
DFYOtfJsXgM 62 77 playing_badminton
DFYOtfJsXgM 85 87 clapping
DFYOtfJsXgM 87 107 speech
DFYOtfJsXgM 108 114 playing_badminton
Visual Event Labels
_-sfoqUa0vs 0 2 violin
_-sfoqUa0vs 0 6 speech
_-sfoqUa0vs 3 4 laughter
_-sfoqUa0vs 7 10 clapping
_-sfoqUa0vs 11 15 violin
_-sfoqUa0vs 16 17 clapping
_-sfoqUa0vs 18 19 laughter
_-sfoqUa0vs 19 22 violin
_-sfoqUa0vs 23 26 clapping
_-sfoqUa0vs 27 29 violin
_-sfoqUa0vs 27 29 speech
_-sfoqUa0vs 32 35 violin
_-sfoqUa0vs 32 35 speech
_-sfoqUa0vs 39 60 violin
_-sfoqUa0vs 42 52 speech
_-sfoqUa0vs 60 61 laughter
_-sfoqUa0vs 62 65 violin
_-sfoqUa0vs 73 74 laughter
_-sfoqUa0vs 74 76 speech
_-sfoqUa0vs 75 76 violin
_-sfoqUa0vs 90 95 laughter
_-sfoqUa0vs 90 93 laughter
_-sfoqUa0vs 95 99 violin
_-sfoqUa0vs 103 106 laughter
_-sfoqUa0vs 107 117 violin
_-sfoqUa0vs 111 112 speech
_-sfoqUa0vs 113 117 speech
_-sfoqUa0vs 122 124 speech
_-sfoqUa0vs 122 141 violin
_-sfoqUa0vs 145 152 violin
_-sfoqUa0vs 153 154 laughter
_-sfoqUa0vs 155 161 violin
_-sfoqUa0vs 157 166 laughter
_-sfoqUa0vs 168 174 violin
_-sfoqUa0vs 175 180 laughter
_-sfoqUa0vs 178 180 violin
_-sfoqUa0vs 180 182 clapping
_-sfoqUa0vs 183 192 violin
_-sfoqUa0vs 197 203 violin
_-sfoqUa0vs 204 205 cry
_-sfoqUa0vs 206 209 violin
_-sfoqUa0vs 210 214 laughter
_-sfoqUa0vs 211 214 clapping
_-sfoqUa0vs 223 247 violin
_-sfoqUa0vs 248 250 laughter
_-sfoqUa0vs 251 255 violin
_-sfoqUa0vs 259 262 violin
_-sfoqUa0vs 263 264 laughter
_-sfoqUa0vs 265 268 violin
_-sfoqUa0vs 274 277 violin
_-sfoqUa0vs 277 278 laughter
_-sfoqUa0vs 279 316 violin
_-sfoqUa0vs 317 318 laughter
_-sfoqUa0vs 319 326 violin
_-sfoqUa0vs 327 328 laughter
_-sfoqUa0vs 329 333 violin
_-sfoqUa0vs 334 335 laughter
_-sfoqUa0vs 336 338 violin
_-sfoqUa0vs 339 340 laughter
_-sfoqUa0vs 342 343 cello
_-sfoqUa0vs 344 351 violin
_-sfoqUa0vs 352 354 cello
_-sfoqUa0vs 355 368 violin
_-sfoqUa0vs 369 370 laughter
_-sfoqUa0vs 371 373 violin
_-sfoqUa0vs 377 384 violin
_-sfoqUa0vs 385 386 laughter
_-sfoqUa0vs 387 394 violin
_-sfoqUa0vs 395 398 cello
_-sfoqUa0vs 399 403 violin
_-sfoqUa0vs 405 408 violin
_-sfoqUa0vs 409 413 clapping
_-sfoqUa0vs 420 431 clapping
_-sfoqUa0vs 420 421 cello
_-sfoqUa0vs 425 428 violin
_-sfoqUa0vs 432 436 violin
_-sfoqUa0vs 437 438 laughter
_-sfoqUa0vs 442 443 clapping
_-sfoqUa0vs 442 443 laughter
_-sfoqUa0vs 448 449 violin
_-sfoqUa0vs 452 454 laughter
_-sfoqUa0vs 452 453 clapping
_-sfoqUa0vs 455 457 violin
_-sfoqUa0vs 455 457 speech
_-sfoqUa0vs 459 460 clapping
_-sfoqUa0vs 461 483 violin
_-sfoqUa0vs 475 477 clapping
_-sfoqUa0vs 484 486 laughter
_-sfoqUa0vs 485 490 violin
_-sfoqUa0vs 490 495 laughter
_-sfoqUa0vs 492 514 violin
_-sfoqUa0vs 515 516 cello
_-sfoqUa0vs 515 522 clapping
_-sfoqUa0vs 523 524 violin
_-sfoqUa0vs 528 530 laughter
_-sfoqUa0vs 528 530 clapping
_-sfoqUa0vs 531 540 violin
_-sfoqUa0vs 541 542 speech
_-sfoqUa0vs 543 546 violin
_-sfoqUa0vs 547 548 drum
_-sfoqUa0vs 549 550 violin
_-sfoqUa0vs 551 552 drum
_-sfoqUa0vs 552 553 violin
_-sfoqUa0vs 554 556 drum
_-sfoqUa0vs 558 559 violin
_-sfoqUa0vs 567 569 violin
_-sfoqUa0vs 571 573 violin
_-sfoqUa0vs 576 578 violin
_-sfoqUa0vs 580 594 violin
_-sfoqUa0vs 592 594 laughter
_-sfoqUa0vs 599 608 violin
_-sfoqUa0vs 608 609 drum
_-sfoqUa0vs 612 613 violin
_-sfoqUa0vs 621 622 drum
_-sfoqUa0vs 622 623 violin
_-sfoqUa0vs 625 626 violin
_-sfoqUa0vs 633 644 horse
_-sfoqUa0vs 645 646 laughter
_-sfoqUa0vs 646 648 horse
_-sfoqUa0vs 649 650 violin
_-sfoqUa0vs 653 654 drum
_-sfoqUa0vs 653 654 clapping
_-sfoqUa0vs 653 656 horse
_-sfoqUa0vs 660 671 horse
_-sfoqUa0vs 666 667 drum
_-sfoqUa0vs 673 675 violin
_-sfoqUa0vs 676 677 drum
_-sfoqUa0vs 680 682 horse
_-sfoqUa0vs 683 687 violin
_-sfoqUa0vs 688 691 drum
_-sfoqUa0vs 692 693 violin
_-sfoqUa0vs 694 695 horse
_-sfoqUa0vs 698 699 violin
_-sfoqUa0vs 699 700 horse
_-sfoqUa0vs 704 707 horse
_-sfoqUa0vs 708 715 violin
_-sfoqUa0vs 716 730 horse
_-sfoqUa0vs 723 725 clapping
_-sfoqUa0vs 730 736 clapping
_-sfoqUa0vs 737 742 violin
Audio Event Labels
08hzunIk81Y 5 23 drum
08hzunIk81Y 23 24 cat
08hzunIk81Y 25 103 drum
08hzunIk81Y 103 104 car
08hzunIk81Y 104 106 speech
08hzunIk81Y 107 210 drum
08hzunIk81Y 131 135 cheering
08hzunIk81Y 136 137 laughter
08hzunIk81Y 139 143 cheering
08hzunIk81Y 140 143 clapping
08hzunIk81Y 150 151 cheering
08hzunIk81Y 177 181 speech
08hzunIk81Y 187 193 cheering
08hzunIk81Y 190 193 clapping
08hzunIk81Y 210 211 speech
08hzunIk81Y 218 292 drum
08hzunIk81Y 295 346 speech
Audio Event Labels
DFYOtfJsXgM 0 4 drum
DFYOtfJsXgM 5 12 playing_baseball
DFYOtfJsXgM 9 114 speech
DFYOtfJsXgM 12 20 playing_badminton
DFYOtfJsXgM 53 61 playing_baseball
DFYOtfJsXgM 62 87 playing_badminton
DFYOtfJsXgM 84 86 cheering
DFYOtfJsXgM 84 86 clapping
DFYOtfJsXgM 107 114 playing_badminton
DFYOtfJsXgM 111 114 cheering
Audio Event Labels
_-sfoqUa0vs 0 6 speech
_-sfoqUa0vs 6 32 cheering
_-sfoqUa0vs 6 37 clapping
_-sfoqUa0vs 27 39 speech
_-sfoqUa0vs 40 41 laughter
_-sfoqUa0vs 42 52 speech
_-sfoqUa0vs 53 55 speech
_-sfoqUa0vs 56 57 speech
_-sfoqUa0vs 63 64 speech
_-sfoqUa0vs 71 74 laughter
_-sfoqUa0vs 73 83 clapping
_-sfoqUa0vs 75 80 speech
_-sfoqUa0vs 85 86 speech
_-sfoqUa0vs 90 95 laughter
_-sfoqUa0vs 90 96 clapping
_-sfoqUa0vs 92 93 speech
_-sfoqUa0vs 96 97 speech
_-sfoqUa0vs 98 103 laughter
_-sfoqUa0vs 103 109 speech
_-sfoqUa0vs 102 105 clapping
_-sfoqUa0vs 111 112 speech
_-sfoqUa0vs 113 125 speech
_-sfoqUa0vs 126 208 violin
_-sfoqUa0vs 156 158 laughter
_-sfoqUa0vs 158 159 clapping
_-sfoqUa0vs 163 164 laughter
_-sfoqUa0vs 164 165 clapping
_-sfoqUa0vs 173 184 clapping
_-sfoqUa0vs 210 217 clapping
_-sfoqUa0vs 214 215 cheering
_-sfoqUa0vs 224 408 violin
_-sfoqUa0vs 247 249 laughter
_-sfoqUa0vs 248 251 clapping
_-sfoqUa0vs 334 335 clapping
_-sfoqUa0vs 407 464 clapping
_-sfoqUa0vs 408 450 cheering
_-sfoqUa0vs 451 453 laughter
_-sfoqUa0vs 455 458 speech
_-sfoqUa0vs 458 460 cheering
_-sfoqUa0vs 462 463 speech
_-sfoqUa0vs 468 509 violin
_-sfoqUa0vs 474 495 clapping
_-sfoqUa0vs 494 495 laughter
_-sfoqUa0vs 509 525 cheering
_-sfoqUa0vs 509 530 clapping
_-sfoqUa0vs 519 525 speech
_-sfoqUa0vs 528 529 laughter
_-sfoqUa0vs 529 537 speech
_-sfoqUa0vs 539 541 speech
_-sfoqUa0vs 542 559 drum
_-sfoqUa0vs 568 570 violin
_-sfoqUa0vs 572 574 violin
_-sfoqUa0vs 576 578 violin
_-sfoqUa0vs 580 624 violin
_-sfoqUa0vs 609 610 drum
_-sfoqUa0vs 609 629 singing
_-sfoqUa0vs 613 614 drum
_-sfoqUa0vs 617 618 drum
_-sfoqUa0vs 621 624 drum
_-sfoqUa0vs 630 640 clapping
_-sfoqUa0vs 634 730 drum
_-sfoqUa0vs 634 730 violin
_-sfoqUa0vs 634 730 cello
_-sfoqUa0vs 723 742 clapping

Visual Event Labels
9s0T5-rPOZo 22 32 alarm
9s0T5-rPOZo 37 40 bicycle
9s0T5-rPOZo 41 92 alarm
9s0T5-rPOZo 107 114 alarm
9s0T5-rPOZo 119 126 alarm
9s0T5-rPOZo 128 140 alarm
9s0T5-rPOZo 141 142 alarm
9s0T5-rPOZo 149 150 alarm
9s0T5-rPOZo 154 179 alarm
9s0T5-rPOZo 213 224 alarm
9s0T5-rPOZo 241 271 alarm
Visual Event Labels
AIFOZsn-bFE 0 9 speech
AIFOZsn-bFE 0 9 frisbee
AIFOZsn-bFE 9 14 dog
AIFOZsn-bFE 12 21 frisbee
AIFOZsn-bFE 14 40 speech
AIFOZsn-bFE 25 27 frisbee
AIFOZsn-bFE 40 109 frisbee
AIFOZsn-bFE 40 45 dog
AIFOZsn-bFE 45 46 speech
AIFOZsn-bFE 46 50 dog
AIFOZsn-bFE 50 52 speech
AIFOZsn-bFE 52 109 dog
AIFOZsn-bFE 56 78 speech
AIFOZsn-bFE 89 92 speech
AIFOZsn-bFE 94 100 speech
AIFOZsn-bFE 103 107 speech
AIFOZsn-bFE 109 109 clapping
AIFOZsn-bFE 110 112 speech
AIFOZsn-bFE 112 113 frisbee
AIFOZsn-bFE 112 113 dog
AIFOZsn-bFE 113 113 clapping
AIFOZsn-bFE 116 118 frisbee
AIFOZsn-bFE 116 119 dog
AIFOZsn-bFE 119 121 speech
AIFOZsn-bFE 122 124 speech
AIFOZsn-bFE 124 131 dog
AIFOZsn-bFE 125 131 frisbee
AIFOZsn-bFE 126 169 speech
Visual Event Labels
7YwxtA1MNuk 0 74 guitar
7YwxtA1MNuk 0 116 laughter
7YwxtA1MNuk 0 134 speech
7YwxtA1MNuk 14 72 singing
7YwxtA1MNuk 113 221 guitar
7YwxtA1MNuk 119 134 singing
7YwxtA1MNuk 147 219 singing
7YwxtA1MNuk 148 235 laughter
7YwxtA1MNuk 161 235 speech
7YwxtA1MNuk 227 235 cheering
7YwxtA1MNuk 227 235 clapping
Audio Event Labels
9s0T5-rPOZo 9 27 speech
9s0T5-rPOZo 28 147 alarm
9s0T5-rPOZo 117 118 speech
9s0T5-rPOZo 156 272 speech
9s0T5-rPOZo 252 283 piano
Audio Event Labels
AIFOZsn-bFE 0 116 speech
AIFOZsn-bFE 108 109 clapping
AIFOZsn-bFE 113 113 clapping
AIFOZsn-bFE 117 119 cheering
AIFOZsn-bFE 119 119 clapping
AIFOZsn-bFE 120 123 speech
AIFOZsn-bFE 125 169 speech
Audio Event Labels
7YwxtA1MNuk 0 13 guitar
7YwxtA1MNuk 13 16 singing
7YwxtA1MNuk 15 226 guitar
7YwxtA1MNuk 17 20 singing
7YwxtA1MNuk 22 29 singing
7YwxtA1MNuk 30 33 singing
7YwxtA1MNuk 34 37 singing
7YwxtA1MNuk 38 77 singing
7YwxtA1MNuk 78 81 singing
7YwxtA1MNuk 82 113 singing
7YwxtA1MNuk 119 122 singing
7YwxtA1MNuk 123 126 singing
7YwxtA1MNuk 127 138 singing
7YwxtA1MNuk 140 181 singing
7YwxtA1MNuk 182 184 singing
7YwxtA1MNuk 186 193 singing
7YwxtA1MNuk 194 218 singing
7YwxtA1MNuk 227 235 cheering
7YwxtA1MNuk 233 234 speech
7YwxtA1MNuk 227 235 clapping

Download

Dataset publicly available for research purposes

Data download


Extracted features:

Annotations (train, val and test set): Available for download at GitHub.

We provide video-level annotations for the training set, provide both video-level and event-level annotations for validation and testing sets. The annotation files are stored in CSV format.

  • Training set
  • train_audio_weakly.csv: video-level audio annotaions of training set
    train_visual_weakly.csv: video-level visual annotaions of training set
    train_weakly.csv: video-level annotations (union of video-level audio annotations and visual annotations) of training set

  • Validation set
  • val_audio_weakly.csv: video-level audio annotaions of validation set
    val_visual_weakly.csv: video-level visual annotaions of validation set
    val_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of validation set
    val_audio.csv: event-level audio annotaions of validation set
    val_visual.csv: event-level visual annotaions of validation set

  • Testing set
  • test_audio_weakly.csv: video-level audio annotaions of testing set
    test_visual_weakly.csv: video-level visual annotaions of testing set
    test_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of testing set
    test_audio.csv: event-level audio annotaions of testing set
    test_visual.csv: event-level visual annotaions of testing set


Publication(s)

If you find our work useful in your research, please cite our paper.

        
        @ARTICLE{hou2023towards,
          title={Towards Long Form Audio-visual Video Understanding},
          author={Hou, Wenxuan and li, Guangyao and Tian, Yapeng and Hu, Di},
          journal={arXiv preprint arXiv:2306.09431},
          year={2023},
        }
        

Disclaimer

The released LFAV dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.


Copyright Creative Commons License

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

The Team

We are a group of researchers working in computer vision from the Renmin University of China and University of Texas at Dallas.


Wenxuan Hou

PhD Student
(Sep 2022 - )
Renmin University of China

Guangyao Li

PhD Candidate
(Sep 2020 - )
Renmin University of China

Yapeng Tian

Assistant Professor
University of Texas at Dallas

Di Hu

Assistant Professor
Renmin University of China