HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost and inference latency of Diffusion Transformers (DiT) in image editing tasks caused by varying mask ratios. The authors propose an input-adaptive token compression framework that leverages spatial frequency analysis: tokens within user-specified edited regions are preserved in full, while unedited regions retain only high-frequency detail tokens, with low-frequency global structure represented by a 16× downsampled image. Mask dilation is incorporated to maintain contextual consistency. This approach introduces, for the first time, input-adaptive frequency-domain token allocation into DiT acceleration. Evaluated on an A100-80GB GPU, the method achieves speedups of 3.13×, 2.59×, and 1.67× for small, medium, and large mask ratios, respectively, without compromising generation quality.

📝 Abstract

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

Problem

Research questions and friction points this paper is trying to address.

latency

image editing

Diffusion Transformers

token compression

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

HiLo-Token

token compression

Diffusion Transformer