Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the trade-off between redundant computation and reconstruction quality in video compression by proposing a training-free, adaptive token allocation mechanism. By analyzing subtle inter-frame variations in the latent space of a frozen continuous video tokenizer, the method employs a temporal L1 difference threshold to directly identify redundant spatial locations, enabling parameter-free dynamic token budget allocation. A lightweight Latent Inpainting Transformer (LIT) with a factorized spatiotemporal attention architecture is introduced to efficiently reconstruct discarded regions. Requiring only a single forward pass through the encoder, the approach achieves competitive reconstruction fidelity on TokenBench and DAVIS, while offering inference speeds approximately 31× faster than ElasticTok-CV and about 2× faster than InfoTok.

📝 Abstract

Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

Problem

Research questions and friction points this paper is trying to address.

adaptive tokenisation

temporal redundancy

video compression

computational overhead

token allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tokenisation

Temporal Redundancy Masking

Latent Inpainting