🤖 AI Summary
This work addresses the limited accuracy of saliency estimation in 360-degree videos by proposing a novel Transformer-based architecture. It is the first to adapt the SegFormer encoder to this task, integrating a custom-designed decoder and a viewing-center bias mechanism to effectively model human gaze behavior. The proposed method significantly outperforms state-of-the-art approaches on three benchmark datasets—Sport360, PVS-HM, and VR-EyeTracking—achieving relative improvements of 8.4%, 2.5%, and 18.6% in Pearson Correlation Coefficient, respectively. These results demonstrate its superior capability in capturing attention patterns, thereby providing a more accurate attention prior for viewport prediction and immersive content optimization.
📝 Abstract
Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.