PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video discrete VAEs suffer from poor cross-modal alignment and weak zero-shot transferability due to single-scale codebooks, limited vocabulary size, and insufficient language supervision. To address these limitations, this work proposes PyraTok, the first language-aligned pyramid-based multi-scale discretization framework for video representation. Building upon a pretrained video VAE, PyraTok integrates Language-aligned Pyramid Quantization (LaPQ) with a shared large binary codebook to learn semantically structured discrete latent variables across multiple spatiotemporal resolutions. The model jointly optimizes multi-scale text-guided quantization and a global autoregressive objective. PyraTok achieves state-of-the-art performance in video reconstruction across ten benchmarks, significantly enhances text-to-video generation quality, and sets new zero-shot records on video segmentation, temporal action localization, and video understanding tasks, supporting resolutions up to 4K/8K.

Technology Category

Application Category

📝 Abstract
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
Problem

Research questions and friction points this paper is trying to address.

discrete video VAE
visual codebook
cross-modal alignment
zero-shot transfer
language supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pyramidal Tokenizer
Language-Aligned Quantization
Discrete Video VAE
Multi-scale Video Representation
Zero-shot Video Understanding
🔎 Similar Papers
No similar papers found.
O
Onkar Susladkar
University of Illinois Urbana-Champaign
T
Tushar Prakash
Independent Researcher
A
Adheesh Juvekar
University of Illinois Urbana-Champaign
Kiet A. Nguyen
Kiet A. Nguyen
Air Force Research Laboratory
deep learningquantum chemistrytwo-photon absorption
D
Dong-Hwan Jang
University of Illinois Urbana-Champaign
I
I. Dhillon
UTAustin, Google
Ismini Lourentzou
Ismini Lourentzou
Assistant Professor, University of Illinois Urbana - Champaign
Machine LearningNatural Language ProcessingComputer Vision