🤖 AI Summary
This study addresses the end-to-end generation of clinical-grade natural language descriptions of psychological symptoms from multimodal health time-series data (e.g., accelerometer, heart rate), overcoming two key bottlenecks: LLMs’ inability to directly model long-duration sensor streams and the scarcity of sensor–text paired data. We propose a patch-level temporal encoder tailored for health sensing and introduce the first large-scale sensor–psychological symptom QA dataset (>100K samples), enabling direct alignment from raw signals to clinical narratives without intermediate representations. Our framework integrates instruction-tuned LLMs, EMA-guided sensor–text alignment distillation, and cross-modal representation projection. Quantitatively, our method significantly outperforms baselines on standard NLP metrics and symptom severity prediction accuracy. Clinically, 13 psychiatrists independently validated the generated narratives for completeness, interpretability, and diagnostic utility.
📝 Abstract
Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.