π€ AI Summary
To address insufficient cross-modal interaction modeling and inadequate exploitation of inter-frame temporal context in audio-visual speaker diarization (AVSD), this paper proposes CASANetβan end-to-end framework for MISP 2025 Task 1. CASANet introduces a novel co-modeling mechanism that jointly integrates cross-modal attention (CA) and intra-modal self-attention (SA). Additionally, it incorporates pseudo-label iterative refinement and overlapping-frame averaging for post-processing to enhance temporal prediction robustness. Evaluated on the official test set, CASANet achieves an 8.18% diarization error rate (DER), representing a 47.3% relative reduction over the baseline of 15.52%. This substantial improvement demonstrates significantly enhanced audio-visual synergy and speaker discrimination accuracy in multi-speaker scenarios.
π Abstract
This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%.