🤖 AI Summary
Existing foundation models for physiological signals are limited in multimodal sleep data modeling by mask-based reconstruction or contrastive learning, struggling to handle the inherent signal stochasticity and ambiguous semantic invariance. This work proposes Hypnos—the first next-token prediction–based multimodal sleep foundation model—which leverages over 20,000 nights of polysomnography data to jointly model eight modalities, including EEG and ECG, through autoregressive learning. By discretizing signals via residual vector quantization (RVQ) and training an RQ-Transformer, Hypnos eliminates the need for positive pairs or masked reconstruction and supports flexible input from any subset of modalities. Experiments demonstrate that with only 1% labeled data, Hypnos matches the performance of strong supervised baselines in sleep staging and surpasses specialized ECG models in daytime atrial fibrillation detection, confirming the strong generalization capability of its learned representations.
📝 Abstract
Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.