Score: 0

From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Published: October 1, 2025 | arXiv ID: 2510.01513v1

By: Basem Rizk , Joel Walsh , Mark Core and more

Potential Business Impact:

Makes computers understand videos better and learn.

Business Areas:

Image Recognition Data and Analytics, Software

Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

Multi-modal video data-pipelines for machine learning with minimal human supervision

CV and Pattern Recognition

Lets computers understand videos and sounds together.

16 Oct 2025 1

89%

Effectively obtaining acoustic, visual and textual data from videos

Multimedia

Creates new data for AI to learn from videos.

6 Sep 2025 0

88%

Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Artificial Intelligence

Spots fake pictures and sounds fast.

1 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

10 pages

From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Makes computers understand videos better and learn.

Technical Abstract

Multi-modal video data-pipelines for machine learning with minimal human supervision

Effectively obtaining acoustic, visual and textual data from videos

Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics