Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient cross-modal interaction modeling and inadequate exploitation of inter-frame temporal context in audio-visual speaker diarization (AVSD), this paper proposes CASANet—an end-to-end framework for MISP 2025 Task 1. CASANet introduces a novel co-modeling mechanism that jointly integrates cross-modal attention (CA) and intra-modal self-attention (SA). Additionally, it incorporates pseudo-label iterative refinement and overlapping-frame averaging for post-processing to enhance temporal prediction robustness. Evaluated on the official test set, CASANet achieves an 8.18% diarization error rate (DER), representing a 47.3% relative reduction over the baseline of 15.52%. This substantial improvement demonstrates significantly enhanced audio-visual synergy and speaker discrimination accuracy in multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract

This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%.

Problem

Research questions and friction points this paper is trying to address.

Develops CASA-Net for audio-visual speaker diarization

Enhances cross-modal interaction via attention mechanisms

Improves timestamp accuracy with pseudo-label refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention module captures audio-visual interactions

Self-attention module learns contextual frame relationships

Pseudo-label refinement enhances timestamp prediction accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow