OmniTryOn: Video Try-On Anything at Once!

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video virtual try-on methods support only single-garment transfer and rely on external priors such as clothing masks, struggling to ensure spatiotemporal consistency and visual quality when handling multiple garments simultaneously. This work introduces the novel task of “arbitrary try-on” and presents the first end-to-end generative framework that operates without external priors. It enables synchronized multi-item transfer in a single inference pass through a wearable cache mechanism initialized from the first frame, maintains dynamic consistency between motion and background via a spatiotemporally coherent RoPE positional encoding (STC-RoPE), and enhances generation quality with a progressive try-on training strategy (GTO). Additionally, we release TryAny-Bench, a new benchmark comprising paired video datasets and evaluation protocols. Experiments demonstrate that our approach significantly outperforms both specialized and general-purpose video editing models, establishing a new standard for arbitrary try-on.

📝 Abstract

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

Problem

Research questions and friction points this paper is trying to address.

video virtual try-on

multi-object try-on

external priors

physical dynamics

visual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Try-On Anything

external-prior-free

Spatiotemporally Consistent RoPE