🤖 AI Summary
This work addresses the imbalance between short-term and long-term dynamic modeling in video salient object detection. We propose a dual-stream Transformer architecture that jointly encodes raw video frames and historical saliency maps. Key innovations include spatiotemporal patch tokenization, saliency-guided adaptive input masking, and explicit integration of saliency priors—enabling synergistic modeling of short-term local changes and long-term dynamic evolution. We quantitatively reveal, for the first time, the critical impact of the short-term versus long-term feature ratio on detection performance; our masking mechanism further enhances model sensitivity to output bias. Evaluated on multiple mainstream video saliency benchmarks, the method achieves state-of-the-art performance. Ablation studies demonstrate that extending long-term context significantly improves first-frame detection accuracy, whereas short-term context exhibits a clear performance saturation threshold.
📝 Abstract
The role of long- and short-term dynamics towards salient object detection in videos is under-researched. We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video. We provide our model with a stream of video frames and past saliency maps, which acts as a prior for the next prediction, and extract spatiotemporal tokens from both modalities. The decomposition of the frame sequence into tokens lets the model incorporate short-term information from within the token, while being able to make long-term connections between tokens throughout the sequence. The core of the system consists of a dual-stream Transformer architecture to process the extracted sequences independently before fusing the two modalities. Additionally, we apply a saliency-based masking scheme to the input frames to learn an embedding that facilitates the recognition of deviations from previous outputs. We observe that the additional prior information aids in the first detection of the salient location. Our findings indicate that the ratio of spatiotemporal long- and short-term features directly impacts the model's performance. While increasing the short-term context is beneficial up to a certain threshold, the model's performance greatly benefits from an expansion of the long-term context.