Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
By: Théo Gigant, Camille Guinaudeau, Frédéric Dufaux
Potential Business Impact:
Helps computers summarize videos and text together.
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
Similar Papers
A Survey on Efficient Vision-Language Models
CV and Pattern Recognition
Makes smart AI work on small, slow devices.
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Computation and Language
Creates better TV show summaries from video.
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)
CV and Pattern Recognition
Find videos faster by understanding their stories.