Audio-FLAN: A Preliminary Release

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-language modeling is hindered by the fragmentation between understanding and generation tasks, primarily due to the absence of a unified, large-scale, multi-task instruction dataset. To address this, we introduce Audio-FLAN—the first large-scale audio instruction-tuning dataset covering 80 diverse tasks across speech, music, and environmental sounds (exceeding 100 million samples), enabling fully unified instruction formatting across audio domains for the first time. Leveraging state-of-the-art audio tokenization, Audio-FLAN employs structured templates for construction, ensuring compatibility with mainstream large language models (LLMs) and the Hugging Face ecosystem. Extensive experiments demonstrate substantial improvements in zero-shot generalization across heterogeneous tasks—including automatic speech recognition (ASR), music description, and acoustic event generation. The dataset is publicly released and actively maintained.

Technology Category

Application Category

📝 Abstract
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
Problem

Research questions and friction points this paper is trying to address.

Lacks unified audio-language models
Insufficient audio instruction-tuning datasets
Separates audio understanding and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-tuning dataset
unified audio-language models
zero-shot audio tasks
🔎 Similar Papers
No similar papers found.
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
Ziya Zhou
Ziya Zhou
The Hong Kong University of Science and Technology
Music TechnologyNatural Language Processing
Jiahao Pan
Jiahao Pan
Hong Kong University of Science and Technology
Speech ProcessingSpeech EnhancmentMusic Generation
Zixuan Li
Zixuan Li
Assistant Professor at ICT, UCAS
Knowledge GraphLarge Language Model
Shuai Fan
Shuai Fan
Beihang University
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
S
Sitong Cheng
The Hong Kong University of Science and Technology
Dongchao Yang
Dongchao Yang
Chinese University of Hong Kong
TTSTTAAudio CodecMulti-modal Audio Fundation Models
Haohan Guo
Haohan Guo
Chinese University of Hong Kong
Speech SynthesisVoice ConversionSpeech Processing
Yujia Xiao
Yujia Xiao
The Chinese University of Hong Kong
Speech
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion
Z
Zixuan Shen
The Hong Kong University of Science and Technology
C
Chuanbo Zhu
The Hong Kong University of Science and Technology
X
Xinshen Zhang
The Hong Kong University of Science and Technology
Tianchi Liu
Tianchi Liu
Tencent, Singapore; Ph.D. @ National University of Singapore; Ex-A*STAR, Singapore
Text-to-SpeechSpeech-LLMSpeaker VerificationAnti-spoofingDeepfake Detection
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
Zeyue Tian
Zeyue Tian
Hong Kong University of Science and Technology
Music GenerationGenerative AIMulti-Modal Learning
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
Emmanouil Benetos
Emmanouil Benetos
Queen Mary University of London
Machine listeningAudio signal processingMusic information retrievalMachine learning
G
Ge Zhang
M-A-P
Y
Yike Guo
The Hong Kong University of Science and Technology
W
Wei Xue
The Hong Kong University of Science and Technology