The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

πŸ“… 2026-01-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of enhancing semantic alignment between general audio and text by proposing a three-stage training framework based on a Large Audio Language Model (LALM). The approach begins with pretraining using automatically generated audio captions, followed by a second pretraining phase leveraging pseudo-labels produced by CLAP, and concludes with fine-tuning on the XACLE dataset. Notably, this is the first study to incorporate CLAP-generated pseudo-labels into LALM pretraining, substantially improving cross-modal alignment. Evaluated on the XACLE test set, the model achieves a Spearman’s rank correlation coefficient (SRCC) of 0.632, significantly outperforming the baseline system (SRCC: 0.334) and securing third place in the competition.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we propose a submission to the x-to-audio alignment (XACLE) challenge. The goal is to predict semantic alignment of a given general audio and text pair. The proposed system is based on a large audio language model (LALM) architecture. We employ a three-stage training pipeline: automated audio captioning pretraining, pretraining with CLAP pseudo-labels, and fine-tuning on the XACLE dataset. Our experiments show that pretraining with CLAP pseudo-labels is the primary performance driver. On the XACLE test set, our system reaches an SRCC of 0.632, significantly outperforming the baseline system (0.334) and securing third place in the challenge team ranking. Code and models can be found at https://github.com/shiotalab-tmu/tmu-xacle2026
Problem

Research questions and friction points this paper is trying to address.

audio-text alignment
semantic alignment
XACLE challenge
large audio language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

large audio language model
CLAP pseudo-labels
three-stage training
audio-text alignment
XACLE challenge
πŸ”Ž Similar Papers
No similar papers found.
A
Ayuto Tsutsumi
Tokyo Metropolitan University, Tokyo, Japan
K
Kohei Tanaka
Tokyo Metropolitan University, Tokyo, Japan
Sayaka Shiota
Sayaka Shiota
Tokyo Metropolitan University
Speech recognitionSpeaker verificationAnti-spoofingSignal processing