Audio-FLAN: A Preliminary Release

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current audio-language modeling is hindered by the fragmentation between understanding and generation tasks, primarily due to the absence of a unified, large-scale, multi-task instruction dataset. To address this, we introduce Audio-FLAN—the first large-scale audio instruction-tuning dataset covering 80 diverse tasks across speech, music, and environmental sounds (exceeding 100 million samples), enabling fully unified instruction formatting across audio domains for the first time. Leveraging state-of-the-art audio tokenization, Audio-FLAN employs structured templates for construction, ensuring compatibility with mainstream large language models (LLMs) and the Hugging Face ecosystem. Extensive experiments demonstrate substantial improvements in zero-shot generalization across heterogeneous tasks—including automatic speech recognition (ASR), music description, and acoustic event generation. The dataset is publicly released and actively maintained.

Technology Category

Application Category

📝 Abstract

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

Problem

Research questions and friction points this paper is trying to address.

Lacks unified audio-language models

Insufficient audio instruction-tuning datasets

Separates audio understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-tuning dataset

unified audio-language models

zero-shot audio tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow