Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

๐Ÿ“… 2026-06-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of modeling speakersโ€™ habitual acoustic ranges and cross-utterance emotional dynamics in conversational speech emotion recognition. To this end, the authors propose a plug-and-play Memory-as-a-Layer adapter that, without altering the architecture of large audio language models, writes dialogue history into a lightweight neural memory during inference and reads it back as residual context aligned with audio tokens. This approach constitutes the first implementation of test-time neural memory as a residual contextual mechanism for emotion recognition, substantially enhancing dialogue-level emotional understanding. Experimental results demonstrate consistent performance gains across multiple audio language models and emotion benchmark datasets, validating the effectiveness of the proposed mechanism.
๐Ÿ“ Abstract
Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.
Problem

Research questions and friction points this paper is trying to address.

speech emotion recognition
conversational context
dialogue state
test-time memory
audio language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time memory
conversational speech emotion recognition
Memory-as-a-Layer
audio language models
residual contextual mechanism
๐Ÿ”Ž Similar Papers
No similar papers found.