We live in a world filled with never-ending streams of multimodal information. Videos captured from natural scenes have two typical characteristics: 1) Long form. They usually span several minutes, covering multiple related events in different categories. These events usually jointly contribute to depicting the main content of the video. 2) Audio-visual. Videos recorded in real-world scenarios usually comprise both audio and visual modalities. These two aspects often exhibit asynchrony, providing unique perspectives in delineating the video content, yet collaboratively facilitating video understanding.
We show an example of long form audio-visual videos, with a length of 121-second. This video shows a badminton game. The audio modality contains three events: cheering, clapping and speech, the visual modality contains five events: playing badminton, cheering, crying, laughing, and clapping. The event cheering and clapping appears both in audio modality and visual modality. These modality-aware events as well as their inherent relations help to effectively infer what happens in the video then achieve a better understanding of the video content. Considering the merits of the above two characteristics, we propose to study video understanding in terms of long form and audio-visual aspect, name as long form audio-visual video understanding.
Task Definition. To achieve a better understanding of long form audio-visual videos, we propose to focus on the multisensory temporal event localization task, which essentially requires the model to predict the start and end time of each audio and visual event in the video. Concretely, we divide the video into several non-overlapping snippets, then predict the event categories of all snippets.
Challenges. Firstly, the video contains multiple events with diverse categories, modalities, and varying lengths. Secondly, understanding the video content requires effectively modeling long-range dependencies and relations across different clips and modalities.
To study the proposed multisensory temporal event localization task, we elaborately build a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos, as existing datasets are not appropriate for our proposed task. Information and highlights of the LFAV dataset are shown below.
We collect videos from YouTube, covering five kinds of daily life to ensure the diversity, complexity, and dynamic of the real world: human-related, sports, musical instruments, tools, and animals. We also construct a label set of 35 kinds of events covering the above scenes,
Videos in LFAV are public on YouTube, and annotated via crowdsourcing. We have explained how the data would be used to crowdworkers. Our dataset does not contain personally identifiable information or offensive content.
Illustrations of our LFAV dataset statistics. (a-d) Statistical analysis of label categories, including the distribution of event numbers in each video; the distribution of video length; the proportion of the top 4 event categories, the top 4 labels represent speech, clapping, cheering, and laughing, which are the most common human actions; the temporal proportion of events that occur on two modalities at the same time. (e) Second-order interactions between all labels, the thicker the line, the closer the association. (f) Distribution of dataset labels black of each category.
Comparison with other datasets.
Our LFAV dataset is collected for the proposed multisensory temporal event localization task,
where diversified domains are covered.
Specifically, the LFAV dataset offers modality-aware annotations for each video,
that is it points out the events are from audio, visual, or both modalities.
Meanwhile, multiple events with different semantic categories are also annotated for better exploring the relation among events.
Videos in the dataset have an average length of 210 seconds and a total length of 302 hours.
† means the EPIC-KITCHENS-100 dataset contains two classification tasks, the label categories of them are 300 and 97, respectively.
‡ means the LVU dataset contains 7 classification tasks, the label categories of all 7 tasks are each no larger than 9.
* means the LLP dataset only provides modality-aware annotations invalidation and testing sets.
Some video examples in the LFAV dataset. Each of them contains multiple modality aware events.
Visual Event Labels
-6v1PqgFrDk 0 36 speech -6v1PqgFrDk 20 415 playing_ping-pong -6v1PqgFrDk 39 47 speech -6v1PqgFrDk 52 59 speech -6v1PqgFrDk 63 87 speech -6v1PqgFrDk 94 103 speech -6v1PqgFrDk 108 122 speech -6v1PqgFrDk 127 187 speech -6v1PqgFrDk 197 231 speech -6v1PqgFrDk 240 251 speech -6v1PqgFrDk 258 268 speech -6v1PqgFrDk 280 362 speech -6v1PqgFrDk 367 377 speech -6v1PqgFrDk 380 415 speech |
Visual Event Labels
DBejGH1-UCI 0 314 piano DBejGH1-UCI 18 27 laughter DBejGH1-UCI 34 65 laughter DBejGH1-UCI 77 86 laughter DBejGH1-UCI 111 154 laughter DBejGH1-UCI 175 185 laughter DBejGH1-UCI 208 216 laughter DBejGH1-UCI 228 248 laughter DBejGH1-UCI 267 303 laughter DBejGH1-UCI 297 298 speech DBejGH1-UCI 315 339 speech DBejGH1-UCI 332 336 laughter DBejGH1-UCI 342 346 laughter DBejGH1-UCI 341 342 speech DBejGH1-UCI 350 352 speech DBejGH1-UCI 358 362 speech DBejGH1-UCI 369 372 speech DBejGH1-UCI 375 381 speech DBejGH1-UCI 387 390 speech DBejGH1-UCI 394 396 speech DBejGH1-UCI 397 417 laughter DBejGH1-UCI 402 403 speech DBejGH1-UCI 406 408 speech DBejGH1-UCI 410 414 speech DBejGH1-UCI 418 423 speech DBejGH1-UCI 428 430 speech DBejGH1-UCI 431 433 speech DBejGH1-UCI 431 435 laughter |
Visual Event Labels
45RDZ3owtvY 0 23 helicopter 45RDZ3owtvY 23 107 chainsaw 45RDZ3owtvY 110 112 chainsaw 45RDZ3owtvY 151 166 chainsaw 45RDZ3owtvY 187 197 chainsaw 45RDZ3owtvY 212 235 chainsaw 45RDZ3owtvY 242 280 chainsaw 45RDZ3owtvY 288 294 chainsaw 45RDZ3owtvY 304 308 chainsaw 45RDZ3owtvY 316 320 chainsaw 45RDZ3owtvY 325 562 chainsaw 45RDZ3owtvY 385 390 speech 45RDZ3owtvY 590 595 helicopter |
|||
Audio Event Labels
-6v1PqgFrDk 0 36 speech -6v1PqgFrDk 24 39 playing_ping-pong -6v1PqgFrDk 39 47 speech -6v1PqgFrDk 48 62 playing_ping-pong -6v1PqgFrDk 52 59 speech -6v1PqgFrDk 63 87 speech -6v1PqgFrDk 74 83 playing_ping-pong -6v1PqgFrDk 87 93 playing_ping-pong -6v1PqgFrDk 94 103 speech -6v1PqgFrDk 104 107 playing_ping-pong -6v1PqgFrDk 108 122 speech -6v1PqgFrDk 123 126 playing_ping-pong -6v1PqgFrDk 127 187 speech -6v1PqgFrDk 141 144 playing_ping-pong -6v1PqgFrDk 159 162 playing_ping-pong -6v1PqgFrDk 167 169 playing_ping-pong -6v1PqgFrDk 174 177 playing_ping-pong -6v1PqgFrDk 180 196 playing_ping-pong -6v1PqgFrDk 197 231 speech -6v1PqgFrDk 200 201 playing_ping-pong -6v1PqgFrDk 230 238 playing_ping-pong -6v1PqgFrDk 240 251 speech -6v1PqgFrDk 252 257 playing_ping-pong -6v1PqgFrDk 258 268 speech -6v1PqgFrDk 269 279 playing_ping-pong -6v1PqgFrDk 280 362 speech -6v1PqgFrDk 283 287 playing_ping-pong -6v1PqgFrDk 292 304 playing_ping-pong -6v1PqgFrDk 357 366 playing_ping-pong -6v1PqgFrDk 367 377 speech -6v1PqgFrDk 378 394 playing_ping-pong -6v1PqgFrDk 380 415 speech |
Audio Event Labels
DBejGH1-UCI 0 3 cheering DBejGH1-UCI 3 5 laughter DBejGH1-UCI 1 3 speech DBejGH1-UCI 5 9 speech DBejGH1-UCI 10 13 speech DBejGH1-UCI 8 14 cheering DBejGH1-UCI 8 13 clapping DBejGH1-UCI 17 20 laughter DBejGH1-UCI 21 297 piano DBejGH1-UCI 20 23 speech DBejGH1-UCI 36 58 singing DBejGH1-UCI 59 165 singing DBejGH1-UCI 165 173 clapping DBejGH1-UCI 177 285 singing DBejGH1-UCI 286 302 clapping DBejGH1-UCI 297 298 speech DBejGH1-UCI 299 303 cheering DBejGH1-UCI 301 329 speech DBejGH1-UCI 330 340 speech DBejGH1-UCI 331 334 laughter DBejGH1-UCI 340 350 cheering DBejGH1-UCI 341 343 laughter DBejGH1-UCI 350 398 speech DBejGH1-UCI 398 402 cheering DBejGH1-UCI 402 404 speech DBejGH1-UCI 404 406 cheering DBejGH1-UCI 406 438 speech DBejGH1-UCI 408 413 cheering DBejGH1-UCI 419 438 clapping DBejGH1-UCI 433 436 cheering DBejGH1-UCI 436 438 speech |
Audio Event Labels
45RDZ3owtvY 0 23 helicopter 45RDZ3owtvY 23 112 chainsaw 45RDZ3owtvY 112 136 helicopter 45RDZ3owtvY 136 445 chainsaw 45RDZ3owtvY 304 305 laughter 45RDZ3owtvY 387 391 speech 45RDZ3owtvY 456 457 laughter 45RDZ3owtvY 462 462 speech 45RDZ3owtvY 463 590 chainsaw 45RDZ3owtvY 590 595 helicopter |
|||
Visual Event Labels
08hzunIk81Y 7 31 bicycle 08hzunIk81Y 7 8 car 08hzunIk81Y 10 17 car 08hzunIk81Y 20 21 car 08hzunIk81Y 27 57 car 08hzunIk81Y 47 48 dance 08hzunIk81Y 52 53 dog 08hzunIk81Y 58 60 bicycle 08hzunIk81Y 59 60 car 08hzunIk81Y 61 62 speech 08hzunIk81Y 62 94 car 08hzunIk81Y 63 76 bicycle 08hzunIk81Y 80 81 bicycle 08hzunIk81Y 86 94 bicycle 08hzunIk81Y 92 93 speech 08hzunIk81Y 95 96 speech 08hzunIk81Y 96 104 bicycle 08hzunIk81Y 98 104 car 08hzunIk81Y 107 109 bicycle 08hzunIk81Y 107 109 car 08hzunIk81Y 111 121 bicycle 08hzunIk81Y 111 119 car 08hzunIk81Y 122 123 bicycle 08hzunIk81Y 125 126 bicycle 08hzunIk81Y 129 137 bicycle 08hzunIk81Y 131 136 car 08hzunIk81Y 136 137 laughter 08hzunIk81Y 138 151 bicycle 08hzunIk81Y 138 144 car 08hzunIk81Y 140 142 laughter 08hzunIk81Y 140 142 clapping 08hzunIk81Y 153 175 bicycle 08hzunIk81Y 155 156 car 08hzunIk81Y 160 163 car 08hzunIk81Y 171 173 car 08hzunIk81Y 174 175 speech 08hzunIk81Y 183 196 bicycle 08hzunIk81Y 185 186 speech 08hzunIk81Y 190 192 laughter 08hzunIk81Y 190 192 clapping 08hzunIk81Y 194 200 car 08hzunIk81Y 197 198 speech 08hzunIk81Y 203 204 speech 08hzunIk81Y 204 225 bicycle 08hzunIk81Y 209 213 car 08hzunIk81Y 217 225 car 08hzunIk81Y 226 227 laughter 08hzunIk81Y 227 228 bicycle 08hzunIk81Y 227 232 car 08hzunIk81Y 230 275 bicycle 08hzunIk81Y 246 271 car 08hzunIk81Y 276 288 bicycle 08hzunIk81Y 281 288 car 08hzunIk81Y 289 290 car 08hzunIk81Y 297 347 speech 08hzunIk81Y 297 347 car 08hzunIk81Y 300 305 dog 08hzunIk81Y 318 321 dog 08hzunIk81Y 335 336 bicycle |
Visual Event Labels
DFYOtfJsXgM 5 11 playing_baseball DFYOtfJsXgM 11 20 playing_badminton DFYOtfJsXgM 20 50 speech DFYOtfJsXgM 50 62 playing_baseball DFYOtfJsXgM 62 77 playing_badminton DFYOtfJsXgM 85 87 clapping DFYOtfJsXgM 87 107 speech DFYOtfJsXgM 108 114 playing_badminton |
Visual Event Labels
_-sfoqUa0vs 0 2 violin _-sfoqUa0vs 0 6 speech _-sfoqUa0vs 3 4 laughter _-sfoqUa0vs 7 10 clapping _-sfoqUa0vs 11 15 violin _-sfoqUa0vs 16 17 clapping _-sfoqUa0vs 18 19 laughter _-sfoqUa0vs 19 22 violin _-sfoqUa0vs 23 26 clapping _-sfoqUa0vs 27 29 violin _-sfoqUa0vs 27 29 speech _-sfoqUa0vs 32 35 violin _-sfoqUa0vs 32 35 speech _-sfoqUa0vs 39 60 violin _-sfoqUa0vs 42 52 speech _-sfoqUa0vs 60 61 laughter _-sfoqUa0vs 62 65 violin _-sfoqUa0vs 73 74 laughter _-sfoqUa0vs 74 76 speech _-sfoqUa0vs 75 76 violin _-sfoqUa0vs 90 95 laughter _-sfoqUa0vs 90 93 laughter _-sfoqUa0vs 95 99 violin _-sfoqUa0vs 103 106 laughter _-sfoqUa0vs 107 117 violin _-sfoqUa0vs 111 112 speech _-sfoqUa0vs 113 117 speech _-sfoqUa0vs 122 124 speech _-sfoqUa0vs 122 141 violin _-sfoqUa0vs 145 152 violin _-sfoqUa0vs 153 154 laughter _-sfoqUa0vs 155 161 violin _-sfoqUa0vs 157 166 laughter _-sfoqUa0vs 168 174 violin _-sfoqUa0vs 175 180 laughter _-sfoqUa0vs 178 180 violin _-sfoqUa0vs 180 182 clapping _-sfoqUa0vs 183 192 violin _-sfoqUa0vs 197 203 violin _-sfoqUa0vs 204 205 cry _-sfoqUa0vs 206 209 violin _-sfoqUa0vs 210 214 laughter _-sfoqUa0vs 211 214 clapping _-sfoqUa0vs 223 247 violin _-sfoqUa0vs 248 250 laughter _-sfoqUa0vs 251 255 violin _-sfoqUa0vs 259 262 violin _-sfoqUa0vs 263 264 laughter _-sfoqUa0vs 265 268 violin _-sfoqUa0vs 274 277 violin _-sfoqUa0vs 277 278 laughter _-sfoqUa0vs 279 316 violin _-sfoqUa0vs 317 318 laughter _-sfoqUa0vs 319 326 violin _-sfoqUa0vs 327 328 laughter _-sfoqUa0vs 329 333 violin _-sfoqUa0vs 334 335 laughter _-sfoqUa0vs 336 338 violin _-sfoqUa0vs 339 340 laughter _-sfoqUa0vs 342 343 cello _-sfoqUa0vs 344 351 violin _-sfoqUa0vs 352 354 cello _-sfoqUa0vs 355 368 violin _-sfoqUa0vs 369 370 laughter _-sfoqUa0vs 371 373 violin _-sfoqUa0vs 377 384 violin _-sfoqUa0vs 385 386 laughter _-sfoqUa0vs 387 394 violin _-sfoqUa0vs 395 398 cello _-sfoqUa0vs 399 403 violin _-sfoqUa0vs 405 408 violin _-sfoqUa0vs 409 413 clapping _-sfoqUa0vs 420 431 clapping _-sfoqUa0vs 420 421 cello _-sfoqUa0vs 425 428 violin _-sfoqUa0vs 432 436 violin _-sfoqUa0vs 437 438 laughter _-sfoqUa0vs 442 443 clapping _-sfoqUa0vs 442 443 laughter _-sfoqUa0vs 448 449 violin _-sfoqUa0vs 452 454 laughter _-sfoqUa0vs 452 453 clapping _-sfoqUa0vs 455 457 violin _-sfoqUa0vs 455 457 speech _-sfoqUa0vs 459 460 clapping _-sfoqUa0vs 461 483 violin _-sfoqUa0vs 475 477 clapping _-sfoqUa0vs 484 486 laughter _-sfoqUa0vs 485 490 violin _-sfoqUa0vs 490 495 laughter _-sfoqUa0vs 492 514 violin _-sfoqUa0vs 515 516 cello _-sfoqUa0vs 515 522 clapping _-sfoqUa0vs 523 524 violin _-sfoqUa0vs 528 530 laughter _-sfoqUa0vs 528 530 clapping _-sfoqUa0vs 531 540 violin _-sfoqUa0vs 541 542 speech _-sfoqUa0vs 543 546 violin _-sfoqUa0vs 547 548 drum _-sfoqUa0vs 549 550 violin _-sfoqUa0vs 551 552 drum _-sfoqUa0vs 552 553 violin _-sfoqUa0vs 554 556 drum _-sfoqUa0vs 558 559 violin _-sfoqUa0vs 567 569 violin _-sfoqUa0vs 571 573 violin _-sfoqUa0vs 576 578 violin _-sfoqUa0vs 580 594 violin _-sfoqUa0vs 592 594 laughter _-sfoqUa0vs 599 608 violin _-sfoqUa0vs 608 609 drum _-sfoqUa0vs 612 613 violin _-sfoqUa0vs 621 622 drum _-sfoqUa0vs 622 623 violin _-sfoqUa0vs 625 626 violin _-sfoqUa0vs 633 644 horse _-sfoqUa0vs 645 646 laughter _-sfoqUa0vs 646 648 horse _-sfoqUa0vs 649 650 violin _-sfoqUa0vs 653 654 drum _-sfoqUa0vs 653 654 clapping _-sfoqUa0vs 653 656 horse _-sfoqUa0vs 660 671 horse _-sfoqUa0vs 666 667 drum _-sfoqUa0vs 673 675 violin _-sfoqUa0vs 676 677 drum _-sfoqUa0vs 680 682 horse _-sfoqUa0vs 683 687 violin _-sfoqUa0vs 688 691 drum _-sfoqUa0vs 692 693 violin _-sfoqUa0vs 694 695 horse _-sfoqUa0vs 698 699 violin _-sfoqUa0vs 699 700 horse _-sfoqUa0vs 704 707 horse _-sfoqUa0vs 708 715 violin _-sfoqUa0vs 716 730 horse _-sfoqUa0vs 723 725 clapping _-sfoqUa0vs 730 736 clapping _-sfoqUa0vs 737 742 violin |
|||
Audio Event Labels
08hzunIk81Y 5 23 drum 08hzunIk81Y 23 24 cat 08hzunIk81Y 25 103 drum 08hzunIk81Y 103 104 car 08hzunIk81Y 104 106 speech 08hzunIk81Y 107 210 drum 08hzunIk81Y 131 135 cheering 08hzunIk81Y 136 137 laughter 08hzunIk81Y 139 143 cheering 08hzunIk81Y 140 143 clapping 08hzunIk81Y 150 151 cheering 08hzunIk81Y 177 181 speech 08hzunIk81Y 187 193 cheering 08hzunIk81Y 190 193 clapping 08hzunIk81Y 210 211 speech 08hzunIk81Y 218 292 drum 08hzunIk81Y 295 346 speech |
Audio Event Labels
DFYOtfJsXgM 0 4 drum DFYOtfJsXgM 5 12 playing_baseball DFYOtfJsXgM 9 114 speech DFYOtfJsXgM 12 20 playing_badminton DFYOtfJsXgM 53 61 playing_baseball DFYOtfJsXgM 62 87 playing_badminton DFYOtfJsXgM 84 86 cheering DFYOtfJsXgM 84 86 clapping DFYOtfJsXgM 107 114 playing_badminton DFYOtfJsXgM 111 114 cheering |
Audio Event Labels
_-sfoqUa0vs 0 6 speech _-sfoqUa0vs 6 32 cheering _-sfoqUa0vs 6 37 clapping _-sfoqUa0vs 27 39 speech _-sfoqUa0vs 40 41 laughter _-sfoqUa0vs 42 52 speech _-sfoqUa0vs 53 55 speech _-sfoqUa0vs 56 57 speech _-sfoqUa0vs 63 64 speech _-sfoqUa0vs 71 74 laughter _-sfoqUa0vs 73 83 clapping _-sfoqUa0vs 75 80 speech _-sfoqUa0vs 85 86 speech _-sfoqUa0vs 90 95 laughter _-sfoqUa0vs 90 96 clapping _-sfoqUa0vs 92 93 speech _-sfoqUa0vs 96 97 speech _-sfoqUa0vs 98 103 laughter _-sfoqUa0vs 103 109 speech _-sfoqUa0vs 102 105 clapping _-sfoqUa0vs 111 112 speech _-sfoqUa0vs 113 125 speech _-sfoqUa0vs 126 208 violin _-sfoqUa0vs 156 158 laughter _-sfoqUa0vs 158 159 clapping _-sfoqUa0vs 163 164 laughter _-sfoqUa0vs 164 165 clapping _-sfoqUa0vs 173 184 clapping _-sfoqUa0vs 210 217 clapping _-sfoqUa0vs 214 215 cheering _-sfoqUa0vs 224 408 violin _-sfoqUa0vs 247 249 laughter _-sfoqUa0vs 248 251 clapping _-sfoqUa0vs 334 335 clapping _-sfoqUa0vs 407 464 clapping _-sfoqUa0vs 408 450 cheering _-sfoqUa0vs 451 453 laughter _-sfoqUa0vs 455 458 speech _-sfoqUa0vs 458 460 cheering _-sfoqUa0vs 462 463 speech _-sfoqUa0vs 468 509 violin _-sfoqUa0vs 474 495 clapping _-sfoqUa0vs 494 495 laughter _-sfoqUa0vs 509 525 cheering _-sfoqUa0vs 509 530 clapping _-sfoqUa0vs 519 525 speech _-sfoqUa0vs 528 529 laughter _-sfoqUa0vs 529 537 speech _-sfoqUa0vs 539 541 speech _-sfoqUa0vs 542 559 drum _-sfoqUa0vs 568 570 violin _-sfoqUa0vs 572 574 violin _-sfoqUa0vs 576 578 violin _-sfoqUa0vs 580 624 violin _-sfoqUa0vs 609 610 drum _-sfoqUa0vs 609 629 singing _-sfoqUa0vs 613 614 drum _-sfoqUa0vs 617 618 drum _-sfoqUa0vs 621 624 drum _-sfoqUa0vs 630 640 clapping _-sfoqUa0vs 634 730 drum _-sfoqUa0vs 634 730 violin _-sfoqUa0vs 634 730 cello _-sfoqUa0vs 723 742 clapping |
|||
Visual Event Labels
9s0T5-rPOZo 22 32 alarm 9s0T5-rPOZo 37 40 bicycle 9s0T5-rPOZo 41 92 alarm 9s0T5-rPOZo 107 114 alarm 9s0T5-rPOZo 119 126 alarm 9s0T5-rPOZo 128 140 alarm 9s0T5-rPOZo 141 142 alarm 9s0T5-rPOZo 149 150 alarm 9s0T5-rPOZo 154 179 alarm 9s0T5-rPOZo 213 224 alarm 9s0T5-rPOZo 241 271 alarm |
Visual Event Labels
AIFOZsn-bFE 0 9 speech AIFOZsn-bFE 0 9 frisbee AIFOZsn-bFE 9 14 dog AIFOZsn-bFE 12 21 frisbee AIFOZsn-bFE 14 40 speech AIFOZsn-bFE 25 27 frisbee AIFOZsn-bFE 40 109 frisbee AIFOZsn-bFE 40 45 dog AIFOZsn-bFE 45 46 speech AIFOZsn-bFE 46 50 dog AIFOZsn-bFE 50 52 speech AIFOZsn-bFE 52 109 dog AIFOZsn-bFE 56 78 speech AIFOZsn-bFE 89 92 speech AIFOZsn-bFE 94 100 speech AIFOZsn-bFE 103 107 speech AIFOZsn-bFE 109 109 clapping AIFOZsn-bFE 110 112 speech AIFOZsn-bFE 112 113 frisbee AIFOZsn-bFE 112 113 dog AIFOZsn-bFE 113 113 clapping AIFOZsn-bFE 116 118 frisbee AIFOZsn-bFE 116 119 dog AIFOZsn-bFE 119 121 speech AIFOZsn-bFE 122 124 speech AIFOZsn-bFE 124 131 dog AIFOZsn-bFE 125 131 frisbee AIFOZsn-bFE 126 169 speech |
Visual Event Labels
7YwxtA1MNuk 0 74 guitar 7YwxtA1MNuk 0 116 laughter 7YwxtA1MNuk 0 134 speech 7YwxtA1MNuk 14 72 singing 7YwxtA1MNuk 113 221 guitar 7YwxtA1MNuk 119 134 singing 7YwxtA1MNuk 147 219 singing 7YwxtA1MNuk 148 235 laughter 7YwxtA1MNuk 161 235 speech 7YwxtA1MNuk 227 235 cheering 7YwxtA1MNuk 227 235 clapping |
|||
Audio Event Labels
9s0T5-rPOZo 9 27 speech 9s0T5-rPOZo 28 147 alarm 9s0T5-rPOZo 117 118 speech 9s0T5-rPOZo 156 272 speech 9s0T5-rPOZo 252 283 piano |
Audio Event Labels
AIFOZsn-bFE 0 116 speech AIFOZsn-bFE 108 109 clapping AIFOZsn-bFE 113 113 clapping AIFOZsn-bFE 117 119 cheering AIFOZsn-bFE 119 119 clapping AIFOZsn-bFE 120 123 speech AIFOZsn-bFE 125 169 speech |
Audio Event Labels
7YwxtA1MNuk 0 13 guitar 7YwxtA1MNuk 13 16 singing 7YwxtA1MNuk 15 226 guitar 7YwxtA1MNuk 17 20 singing 7YwxtA1MNuk 22 29 singing 7YwxtA1MNuk 30 33 singing 7YwxtA1MNuk 34 37 singing 7YwxtA1MNuk 38 77 singing 7YwxtA1MNuk 78 81 singing 7YwxtA1MNuk 82 113 singing 7YwxtA1MNuk 119 122 singing 7YwxtA1MNuk 123 126 singing 7YwxtA1MNuk 127 138 singing 7YwxtA1MNuk 140 181 singing 7YwxtA1MNuk 182 184 singing 7YwxtA1MNuk 186 193 singing 7YwxtA1MNuk 194 218 singing 7YwxtA1MNuk 227 235 cheering 7YwxtA1MNuk 233 234 speech 7YwxtA1MNuk 227 235 clapping |
Extracted features:
Annotations (train, val and test set): Available for download at GitHub.
We provide video-level annotations for the training set, provide both video-level and event-level annotations for validation and testing sets. The annotation files are stored in CSV format.
If you find our work useful in your research, please cite our paper.
@article{hou2023towards,
title={Towards Long Form Audio-visual Video Understanding},
author={Hou, Wenxuan and Li, Guangyao and Tian, Yapeng and Hu, Di},
journal={ACM Transactions on Multimedia Computing, Communications and Applications},
year={2023},
publisher={ACM New York, NY}
}
The released LFAV dataset is curated, which perhaps owns potential correlation between instrument and geographical area. This issue warrants further research and consideration.
All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
We propose an event-centric framework that contains three phases from snippet prediction, event extraction to event interaction.
Firstly, we propose a pyramid multimodal transformer model to capture the events with different temporal lengths,
where the audio and visual snippet features are required to interact with each other within multi-scale temporal windows.
Secondly, we propose to model the video as structured event graphs according to the snippet prediction,
based on which we refine the event-aware snippet-level features and aggregate them into event features.
At last, we study event relations by modeling the influence among multiple aggregated audio and visual events and then refining the event features.
The three phases progressively achieve a comprehensive understanding of video content as well as event relations
and are jointly optimized with video-level event labels in an end-to-end fashion.
We want to highlight that the inherent relations among multiple events are essential for
understanding the temporal structures and dynamic semantic of the long form audio-visual videos,
which has not been sufficiently considered in previous event localization works.
More details are in the paper (paper).
An illustration of our event-centric framework. Top: In the first phase of snippet prediction, we propose a pyramid multimodal transformer to generate the snippet features as well as their category prediction. Middle left: In the second phase of event extraction, we build an event-aware graph to refine the snippet features and then aggregate the event-aware snippet features into event features. Middle right: In the third phase of event interaction, we model the event relations in both intra-modal and cross-modal scenarios and then refine the event feature by referring to its relation to other events. Bottom left The architecture of temporal attention pooling. It both outputs snippet-level predictions and video-level predictions. The inside attention weights are used to obtain event features in the event extraction phase. Bottom right: An equivalent form of window attention in the PMT Layer, shows how the window operates in the first phase. The window splits the feature sequence into several sub-sequences, then these sub-sequences are performed attention operations respectively. Here we show an example of self attention, operation in cross-modal attention is similar.
To validate the superiority of our proposed framework, we choose 16 related methods for comparison, including weakly supervised temporal action localization methods: STPN, RSKP; long sequence modeling methods: Longformer, Transformer-LS, ActionFormer, FLatten Transformer; audio-visual learning methods: AVE, AVSlowFast, HAN, PSP, DHHN, CMPAE, CM-PIE; video classification methods: SlowFast, MViT, and MeMViT.
Comparison to Other Methods. Firstly, temporal action localization and long sequence modeling methods aim to effectively localize action events in untrimmed videos or model long sequences. But they ignore the valuable cooperation among audio and video modality, which is important in achieving more comprehensive video event understanding. Secondly, although some methods take the audio signal into account, they are consistently worse than our method. This could be because they mainly aim at understanding trimmed short videos, resulting in limited modeling of long-range dependencies and event interactions. Thirdly, our proposed method outperforms all the comparison ones obviously, although some recent video classification methods achieve slightly better results on visual mAP, their overall performance still lags obviously behind our proposed method, showing that our proposed event-centric framework can localize both audio and visual events in long form audio-visual videos better.
Effectiveness of Three Phases. Our full method consists of three progressive phases. The performance of the snippet prediction phase has already surpassed most comparison methods, then the subsequent phases can further improve localization performance. Results are shown in the last three rows of the above Table, which indicate the potential importance of decoupling a long form audio-visual video into multiple uni-modal events with different lengths and modeling their inherent relations in both uni-modal and cross-modal scenarios.
We visualize the event-level localization results in the videos, two examples are shown in the above figure. Compared with the audio-visual video parsing method HAN, our proposed method achieves better localization results. In some situations (e.g., event guitar in both audio and visual modality of video 01, and event speech in the audio modality of video 02), HAN tends to localize some sparse and short video clips instead of a long and complete event, which shows that HAN exists some limitations to understanding long-form videos. The possible reason is that HAN cannot learn long-range dependencies well.
We also notice that, although our proposed event-centric method has achieved the best performance among all methods,
there still exist some failure cases in the shown examples (red and black boxes in the figure).
The multisensory events take huge different lengths and occur in a dynamic long-range scene,
which makes multisensory temporal event localization become a very challenging task,
especially with only video-level labels in training. More experiment results and analysis are in the paper (paper).
We are a group of researchers working in multimodal learning and computer vision from the Renmin University of China and University of Texas at Dallas.