HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limited robustness of embodied agents in smart homes—caused by visual constraints (e.g., occlusion, low illumination, privacy restrictions)—in behavior understanding and language-grounded interaction, this paper introduces the first multimodal large language model (MLLM) integrating non-visual sensing modalities: LiDAR, infrared, millimeter-wave radar, and WiFi. To overcome two key challenges—scarcity of aligned sensor-text data and inherent heterogeneity of physical signals—we propose a universal modality injection projector (UMIP) and a human–VLM collaborative data distillation framework, incorporating coarse-to-fine cross-modal attention, sensor-specific encoders, and physics-aware signal embedding alignment. Evaluated on two newly established benchmarks, our method achieves up to 30% improvement in language-grounded human perception accuracy, significantly outperforming existing MLLMs. It enables privacy-preserving, robust, language-guided human perception and reasoning under visually degraded conditions.

Technology Category

Application Category

📝 Abstract

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.

Problem

Research questions and friction points this paper is trying to address.

Enabling robust human behavior understanding via diverse non-visual sensors

Overcoming data scarcity for rare sensor-text alignment in multimodal models

Addressing signal heterogeneity across uncommon sensing modalities for perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LiDAR, infrared, mmWave, WiFi sensors

Universal Modality-Injection Projector enhances embeddings

Human-VLM pipeline generates textual annotations

🔎 Similar Papers

No similar papers found.

Authors to Follow