🤖 AI Summary
Efficiently constructing queryable and evolvable knowledge graph (KG) indexes from multimodal video content remains challenging due to semantic heterogeneity and computational complexity. Method: This paper proposes a lightweight, modular multimodal analysis framework that integrates vision, speech, and language foundation models to parse videos into temporally ordered, semi-structured frame-level semantic units, and automatically generates SPARQL-queryable index KGs. The framework decouples perception, cross-modal alignment, and graph construction into independent modules, enabling plug-and-play integration of open-source models and minimizing engineering overhead. Contribution/Results: It introduces an interactive knowledge expansion interface and an incremental KG update mechanism to support dynamic domain-knowledge injection and continual learning. Experiments demonstrate that the approach maintains high semantic fidelity while significantly reducing computational cost and deployment complexity—enhancing scalability and practicality of multimodal KG systems.
📝 Abstract
Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.