🤖 AI Summary
To address low prediction accuracy and high bandwidth overhead in 360-degree video streaming, this paper proposes a novel Deep Hybrid Saliency Model (DHSM) that jointly leverages spherical CNN-extracted visual features with multimodal cues—including motion and audio—to accommodate the geometric properties of omnidirectional video. The model comprises three key components: spherical frame preprocessing, multiscale saliency feature fusion, and viewport-constrained post-processing. Evaluated on the 360RAT dataset, DHSM reduces KL divergence between predicted saliency maps and subjective gaze annotations by 23.6%, outperforming state-of-the-art methods. Integrated into viewport-adaptive streaming and intelligent cropping pipelines, it achieves a 31% reduction in average bandwidth consumption while improving user Quality of Experience (QoE). The core contribution is an end-to-end, multimodal saliency modeling framework specifically designed for panoramic video, enabling accurate, geometry-aware, and computationally efficient saliency prediction.
📝 Abstract
The main goal of the project is to design a new model that predicts regions of interest in 360$^{circ}$ videos. The region of interest (ROI) plays an important role in 360$^{circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.