FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
By: Siddhant Sukhani , Yash Bhardwaj , Riya Bhadani and more
Potential Business Impact:
Helps computers understand money videos by watching and listening.
We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.
Similar Papers
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
Information Retrieval
Helps computers understand videos better and faster.
A Video Is Not Worth a Thousand Words
CV and Pattern Recognition
Shows how AI understands videos and text.
Aligned Better, Listen Better for Audio-Visual Large Language Models
CV and Pattern Recognition
Helps computers understand videos by listening.