LTX-2: Efficient Joint Audio-Visual Foundation Model

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevailing limitation of existing text-to-video diffusion models in generating high-quality, synchronized audio that aligns semantically, emotionally, and atmospherically with the visual content. To this end, we propose a unified audio-visual generative foundation model featuring an asymmetric dual-stream Transformer architecture—comprising a 14B-parameter video stream and a 5B-parameter audio stream. The model leverages modality-aware classifier-free guidance (CFG), cross-modal AdaLN, and bidirectional audio-visual cross-attention mechanisms to achieve efficient co-generation and precise temporal alignment. Integrated with temporal positional encoding and a multilingual text encoder, our approach achieves state-of-the-art audio-visual quality and prompt fidelity within an open-source framework, matching the performance of closed-source counterparts while significantly reducing computational overhead and inference latency.

Technology Category

Application Category

📝 Abstract
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
Problem

Research questions and friction points this paper is trying to address.

text-to-video
audiovisual generation
diffusion models
multimodal alignment
foundation model
Innovation

Methods, ideas, or system contributions that make the work stand out.

audiovisual generation
asymmetric dual-stream transformer
cross-modal attention
modality-aware CFG
text-to-audiovisual diffusion
🔎 Similar Papers
No similar papers found.
Yoav HaCohen
Yoav HaCohen
PhD, Hebrew University, Lightricks
Multimodal Generative AIComputational PhotographyComputer Vision
B
Benny Brazowski
Lightricks
Nisan Chiprut
Nisan Chiprut
Ligtricks
GenAI
Y
Yaki Bitterman
Lightricks
A
Andrew Kvochko
Lightricks
A
Avishai Berkowitz
Lightricks
D
Daniel Shalem
Lightricks
D
Daphna Lifschitz
Lightricks
D
Dudu Moshe
Lightricks
E
Eitan Porat
Lightricks
Eitan Richardson
Eitan Richardson
Researcher, Lightricks Ltd
Deep LearningComputer VisionGenerative AI
G
Guy Shiran
Lightricks
I
Itay Chachy
Lightricks
J
Jonathan Chetboun
Lightricks
M
Michael Finkelson
Lightricks
M
Michael Kupchick
Lightricks
Nir Zabari
Nir Zabari
Researcher
Deep LearningComputer VisionImage Processing
N
N. Guetta
Lightricks
N
Noa Kotler
Lightricks
Ofir Bibi
Ofir Bibi
Lightricks, Hebrew University of Jerusalem
Machine LearningDeep LearningArtificial IntelligenceStatistical Signal Processing
O
Ori Gordon
Lightricks
P
Poriya Panet
Lightricks
R
Roi Benita
Lightricks
S
Shahar Armon
Lightricks
V
Victor Kulikov
Lightricks
Y
Yaron Inger
Lightricks
Y
Y. Shiftan
Lightricks
Z
Zeev Melumian
Lightricks
Z
Zeev Farbman
Lightricks