πŸŽ“M3AV

A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

1Cooperative Medianet Innovation Center, Shanghai JiaoTong University
2Department of Electronic Engineering, Tsinghua University 3University of Cambridge Department of Engineering
4Shanghai AI Laboratory
ACL 2024 Main Conference

Abstract

Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.

In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (πŸŽ“M3AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks.

Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of πŸŽ“M3AV makes it a challenging dataset.

πŸŽ“M3AV Dataset

Overview

The overview of our πŸŽ“M3AV dataset is shown below. The first component is slides annotated with simple and complex blocks. They will be merged following some rules. The second component is speech containing special vocabulary, spoken and written forms, and word-level timestamps. The third component is the paper corresponding to the video. The asterisk (*) denotes that only computer science videos have corresponding papers.

Figure 1: The overview of πŸŽ“M3AV dataset.

Statistics

Figure 2: Statistics of πŸŽ“M3AV dataset.

Comparison with Related Work

πŸŽ“M3AV dataset contains the most complete and human-annotated resources of slide, speech, and paper, thus supporting not only the recognition of multimodal content but also the comprehension of high-level academic knowledge. At the same time, the size of our dataset is also relatively rich while accessible.

Table 1: Comparison with other academic lecture-based datasets in terms of data types and designed tasks. "A" denotes fully automated processing and "M" denotes fully or partially manual labelling.

context

Table 2: Comparison with other academic lecture-based datasets in terms of data size and availability.

Benchmark Systems

ASR & Contextual ASR

End-to-end models suffer from rare word recognition as reflected by the BWER where a more than two times increase in the error rate is observed comparing BWER to the WER.
By using TCPGen utilizing the OCR information (contextual ASR), we achieve a relative BWER decrease of 37.8% and 34.2% on dev and test sets respectively.

Table 3: Evaluation results on ASR and CASR tasks.

Spontaneous TTS

The MQTTS model shows the best performance within all the evaluation metrics. It indicates that the real speech in our dataset can drive AI systems to simulate more natural speech.

Table 4: Evaluation results on Spontaneous TTS task. β€œGT” denotes the ground truth.

Slide and Script Generation

(1) The open-source models (LLaMA-2, InstructBLIP) show a limited performance improvement when raised from 7B to 13B. Their performances are far from the closed-source models (GPT-4 and GPT-4V). We believe that high-quality pre-training data, e.g., informative corpus and visual QA data which encapsulate multimodal information, is required to enhance their SSG performances beyond just boosting the model size.
(2) The latest LMM (GPT-4V) has already exceeded the cascaded pipeline composed of unimodal expert models. It suggests that the LMM not only maintains the ability to process textual information but also possesses multi-sensory capabilities, such as the perception and recognition of the slides.

Table 5: Evaluation results on SSG tasks. The upper part of β€œSlideβ†’Script" shows cascading pipelines, while the lower part shows integrated systems.

(3) RAG substantially enhances the generation, as shown in the improvement after the introduction of paper information.

Table 6: Performance improvements of LLaMA-27B brought by retrieving paper information. β€œSubset” denotes that only Computer Science videos are contained in all sets for they are the only ones with downloadable papers.

Conclusion

We release the Multimodal, Multigenre, and Multipurpose Audio-Visual Dataset with Academic Lectures (πŸŽ“M3AV) covering a range of academic fields. This dataset contains manually annotated speech transcriptions, slide text, and additional extracted papers, providing a basis for evaluating AI models for recognizing multimodal content and understanding academic knowledge. We detail the creation pipeline and conduct various analyses of the dataset. Furthermore, we build benchmarks and conduct experiments around the dataset. We find there is still large room for existing models to improve perceptions and understanding of academic lecture videos.

BibTeX

@article{chen2024m3av,
      title={{M\textsuperscript{3}AV}: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset},
      author={Chen, Zhe and Liu, Heyang and Yu, Wenyi and Sun, Guangzhi and Liu, Hongcheng and Wu, Ji and Zhang, Chao and Wang, Yu and Wang, Yanfeng},
      journal={arXiv preprint arXiv:2403.14168},
      year={2024}
}