Learning Sparsity for Effective and Efficient Music Performance Question Answering

📅 2025-06-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in music audio-visual question answering (Music AVQA)—including difficulty isolating critical information due to dense representations, high inter-modal redundancy, and inefficient sample utilization—this paper proposes Sparsify, the first end-to-end sparse learning framework for Music AVQA. Sparsify integrates sparse representation learning, multimodal feature pruning, and audio-visual joint attention distillation, augmented by a novel critical subset selection algorithm that achieves 70–80% of full-dataset performance using only 25% of training samples. Evaluated on the Music AVQA benchmark, Sparsify establishes new state-of-the-art performance while reducing training time by 28.32% without accuracy loss, significantly improving both inference efficiency and data efficiency. Its core innovation lies in the systematic incorporation of structured sparsity into multimodal AVQA modeling, unifying representation compactness, computational efficiency, and sample-aware learning.

Technology Category

Application Category

📝 Abstract
Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance on the Music AVQA datasets. In addition, it reduces training time by 28.32% compared to its fully trained dense counterpart while maintaining accuracy, demonstrating clear efficiency gains. To further improve data efficiency, we propose a key-subset selection algorithm that selects and uses approximately 25% of MUSIC-AVQA v2.0 training data and retains 70-80% of full-data performance across models.
Problem

Research questions and friction points this paper is trying to address.

Addressing inefficiencies in audio-visual representation for Music AVQA
Reducing redundancy and isolating key information in dense music performance data
Improving data efficiency with sparse learning and subset selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse learning framework for Music AVQA
Three sparsification strategies in pipeline
Key-subset selection algorithm reduces data
🔎 Similar Papers
No similar papers found.