MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Translating anime song lyrics requires simultaneous optimization of semantic fidelity, syllabic timing, poetic meter, and audiovisual synchronization—posing significant challenges. To address this, we propose MAVL, the first multilingual audiovisual lyric translation benchmark explicitly designed for singability, featuring aligned text, audio, and video modalities. Methodologically, we introduce SylAVL-CoT, a novel large language model incorporating syllable-level hard constraints and chain-of-thought (CoT) decoding, jointly guided by audiovisual cues via multimodal feature fusion and fine-tuned on multilingual data. Experimental results demonstrate that SylAVL-CoT substantially outperforms text-only baselines in both singability and contextual accuracy, validating the effectiveness and necessity of integrating multimodal alignment and multilingual modeling for anime song translation.

Technology Category

Application Category

📝 Abstract

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

Problem

Research questions and friction points this paper is trying to address.

Lyrics translation needs semantic accuracy and musical rhythm preservation

Animated songs require alignment with visual and auditory cues

Multimodal approaches improve singability and contextual accuracy in translations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark integrating text, audio, video

Syllable-Constrained Audio-Video LLM for lyrics

Chain-of-Thought SylAVL-CoT enhances singability

🔎 Similar Papers

No similar papers found.

Authors to Follow