Real-time Cross-modal Cybersickness Prediction in Virtual Reality

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Motion sickness remains a prevalent challenge in virtual reality (VR), impairing user comfort and acceptance. Method: This paper proposes a lightweight, real-time cross-modal prediction model that dynamically assesses VR-induced discomfort by jointly modeling VR video streams and multi-source biosignals—including eye movement, head motion, and physiological signals (skin temperature, galvanic skin response, and blood pressure). We introduce a video-aware biosignal representation learning scheme and a novel hybrid architecture integrating sparse self-attention Transformers with Parallelized Pooling Temporal Segment Networks (PP-TSN) to overcome bottlenecks in multimodal temporal modeling and edge-device real-time inference. The model comprises a sparse encoder, PP-TSN-based video feature extraction, cross-modal fusion, and a lightweight joint training module. Contribution/Results: On a public benchmark dataset, the model achieves 93.13% accuracy using VR video alone—substantially outperforming CNN- and LSTM-based baselines—while enabling sub-millisecond end-to-end inference latency, thereby significantly enhancing VR comfort and usability.

Technology Category

Application Category

📝 Abstract
Cybersickness remains a significant barrier to the widespread adoption of immersive virtual reality (VR) experiences, as it can greatly disrupt user engagement and comfort. Research has shown that cybersickness can significantly be reflected in head and eye tracking data, along with other physiological data (e.g., TMP, EDA, and BMP). Despite the application of deep learning techniques such as CNNs and LSTMs, these models often struggle to capture the complex interactions between multiple data modalities and lack the capacity for real-time inference, limiting their practical application. Addressing this gap, we propose a lightweight model that leverages a transformer-based encoder with sparse self-attention to process bio-signal features and a PP-TSN network for video feature extraction. These features are then integrated via a cross-modal fusion module, creating a video-aware bio-signal representation that supports cybersickness prediction based on both visual and bio-signal inputs. Our model, trained with a lightweight framework, was validated on a public dataset containing eye and head tracking data, physiological data, and VR video, and demonstrated state-of-the-art performance in cybersickness prediction, achieving a high accuracy of 93.13% using only VR video inputs. These findings suggest that our approach not only enables effective, real-time cybersickness prediction but also addresses the longstanding issue of modality interaction in VR environments. This advancement provides a foundation for future research on multimodal data integration in VR, potentially leading to more personalized, comfortable and widely accessible VR experiences.
Problem

Research questions and friction points this paper is trying to address.

Virtual Reality
Cybersickness Prediction
Sensory Stimulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time Prediction
Virtual Reality Sickness
Video Information Processing
🔎 Similar Papers
No similar papers found.
Yitong Zhu
Yitong Zhu
the Hong Kong University of Science and Technology(GuangZhou))
human factors engineeringmulti-modal learningaffective computing
T
Tangyao Li
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yuyang Wang
The Hong Kong University of Science and Technology (Guangzhou)