π€ AI Summary
In multi-speaker scenarios, users frequently miss critical conversational content.
Method: This paper proposes an intelligent auditory memory system for personal space computing. It introduces the first integration of beamforming with retrieval-augmented generation (RAG): spatial audio is captured via a microphone array and speaker-separated using directional beamforming; Whisper-based transcription and sentence encoding jointly construct a temporally aligned embedding database; GPT-4o-mini generates spatiotemporally tagged contrastive summaries, enabling natural-language querying and interactive 3D audio playback.
Contribution/Results: We present the first end-to-end auditory memory framework supporting semantic retrieval, context-aware summarization, and spatial audio reconstruction. Evaluated in realistic multi-speaker environments, it achieves high-precision topic retrieval and interpretable traceability, significantly enhancing usersβ ability to semantically recall missed dialogue.
π Abstract
We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments. The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries such as, "What did I miss when I was following the conversation on dogs?" Directional audio streams are separated using beamforming, transcribed with Whisper, and embedded into a vector database using sentence encoders. Upon receiving a user query, semantically relevant segments are retrieved, temporally aligned with non-attended segments, and summarized using a lightweight large language model (GPT-4o-mini). The result is a user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback. This work lays the foundation for intelligent auditory memory systems and has broad applications in assistive technology, meeting summarization, and context-aware personal spatial computing.