TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of text recognition in urban surveillance videos, where motion blur, occlusion, and scale variation often degrade performance, and frame-level independent processing leads to inconsistent and inaccurate results. To overcome these limitations, the authors propose TraRA, the first trajectory-level text recognition framework. TraRA aggregates cross-frame visual features through temporal clustering and integrates them with a low-rank adapted vision-language model to enable effective multimodal contextual fusion. Evaluated on four benchmarks—RoadText, BOVText, ArTVideo, and ICDAR15—the method consistently outperforms state-of-the-art approaches, demonstrating significantly improved robustness and stability in recognizing and tracking text under complex real-world surveillance conditions.

📝 Abstract

Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.

Problem

Research questions and friction points this paper is trying to address.

Video Text Spotting

motion blur

occlusion

scale variation

frame-level recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-level Recognition

Temporal Clustering

Vision-Language Aggregation