🤖 AI Summary
Large language models (LLMs) lack fundamental perceptual grounding in the physical world, particularly in reasoning about acoustic phenomena governed by underlying physical principles—e.g., Doppler effect, multipath propagation, and spatial geometric constraints. To address this, we propose ACORN: a physics-informed framework that introduces AQA-PHY, the first audio question-answering dataset generated via a physically grounded sound simulator. ACORN jointly models both magnitude and phase components of audio signals, incorporates a phase-sensitive audio encoder, and integrates physics-aware priors into a multimodal LLM architecture with explicit audio–text alignment. Evaluated on line-of-sight detection, Doppler shift estimation, and direction-of-arrival estimation, ACORN significantly outperforms existing baselines. Our results demonstrate that audition serves as an effective modality for endowing LLMs with foundational physical awareness. This work pioneers the new research direction of *physics-perceptive audio-language modeling*.
📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness--understanding of real-world physical phenomena. In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.