CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic audio captioning (AAC) methods, such as EnCLAP, rely on discrete tokens generated by EnCodec—a codec optimized for waveform reconstruction rather than semantic representation—resulting in weak semantic expressivity and limiting caption quality. To address this, we propose the first semantic-aware discrete tokenization paradigm for AAC: leveraging the CLAP pre-trained audio encoder as the feature backbone, we integrate a differentiable vector quantization (VQ) module to construct a semantically rich audio tokenizer, and jointly fine-tune it end-to-end with a BART language model. This framework significantly improves audio–text semantic alignment. Empirical evaluation on two mainstream AAC benchmarks demonstrates consistent and substantial gains over the EnCLAP baseline, validating that semantics-driven discretization is critical for advancing AAC performance.

Technology Category

Application Category

📝 Abstract
Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.
Problem

Research questions and friction points this paper is trying to address.

Improving Automated Audio Captioning with semantic-rich tokens
Addressing EnCodec's limitation in capturing sound semantics
Enhancing AAC performance using vector-quantized audio representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-rich audio representation tokenizer
Vector quantization of pre-trained audio
Discrete tokens for audio captioning
🔎 Similar Papers
No similar papers found.