🤖 AI Summary
To address fine-grained temporal understanding and reasoning in long videos, this paper introduces TemporalVLM—the first video large language model (VLLM) integrating a bidirectional LSTM (BiLSTM). Methodologically, it proposes segmented temporal encoding and bidirectional temporal aggregation to jointly model visual features and timestamp embeddings, enabling both local action modeling and global temporal dependency capture; it further incorporates multi-level time-aware feature fusion and video–language cross-modal alignment. Key contributions include: (1) the first integration of BiLSTM architecture into a VLLM; and (2) the construction of IndustryASM—the first long-video benchmark tailored to industrial assembly scenarios. Experiments demonstrate that TemporalVLM achieves state-of-the-art performance across four tasks—dense video captioning, temporal localization, highlight detection, and action segmentation—outperforming all prior methods on both TimeIT and IndustryASM benchmarks.
📝 Abstract
This paper introduces TemporalVLM, a video large language model (video LLM) capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, which consists of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To the best of our knowledge, our work is the first to incorporate LSTMs into video LLMs.