OmniTryOn: Video Try-On Anything at Once!

Bowen Ping; Changliang Xia; Chengyou Jia; Minnan Luo; Xin Shen; Zhuohang Dang

arxiv: 2606.08514 · v1 · pith:VV7MCWW6new · submitted 2026-06-07 · 💻 cs.CV

OmniTryOn: Video Try-On Anything at Once!

Changliang Xia , Chengyou Jia , Minnan Luo , Zhuohang Dang , Xin Shen , Bowen Ping This is my paper

Pith reviewed 2026-06-27 18:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords video virtual try-onmulti-object synthesisTry-On AnythingTryAny-BenchSTC-RoPEFirst Frame Wearable Cachegenerative video modelvideo editing

0 comments

The pith

OmniTryOn enables simultaneous transfer of diverse wearable objects onto video subjects in one pass without external priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a Try-On Anything task that requires transferring multiple wearable objects onto a person across an entire video in a single forward pass. It supplies the paired TryAny-Bench dataset and protocol so that progress on this task can be measured uniformly. The proposed OmniTryOn model avoids garment masks by caching objects directly from the first frame and anchors motion with a modified positional encoding, then trains the model in stages to build multi-object capability. Existing single-garment and mask-dependent methods are shown to be insufficient for this setting. A reader would care because the approach removes a major practical barrier to realistic video editing of clothing and accessories.

Core claim

OmniTryOn is an external-prior-free generative framework that employs the First Frame Wearable Cache strategy to supply diverse wearable objects directly from the initial video frame and the Spatiotemporally Consistent RoPE (STC-RoPE) to establish robust spatiotemporal anchors preserving complex human motions and background dynamics, optimized by the Gradual Try-On (GTO) training strategy to master multi-object synthesis.

What carries the argument

First Frame Wearable Cache strategy together with Spatiotemporally Consistent RoPE (STC-RoPE), which supplies objects from the first frame and maintains spatiotemporal consistency.

If this is right

Simultaneous multi-object try-on becomes feasible in a single inference without separate mask preparation.
Physical dynamics remain intact because explicit external priors are not required.
The TryAny-Bench benchmark provides a standardized way to compare methods on this new task.
Gradual Try-On training progressively improves the model's ability to synthesize multiple objects together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same first-frame caching idea could be tested on other video object-insertion problems such as accessory or background element transfer.
The benchmark may push future models to handle longer sequences or higher object counts while keeping the no-prior constraint.
Real-time applications could become viable if the inference cost of the cache and RoPE modifications is further reduced.

Load-bearing premise

That caching objects from the first frame and applying STC-RoPE can deliver diverse items while preserving motion and background dynamics without any external masks or priors.

What would settle it

Test videos containing rapid pose changes or overlapping wearables in which the output shows drifting object identity, broken motion continuity, or visible artifacts relative to the input sequence.

Figures

Figures reproduced from arXiv: 2606.08514 by Bowen Ping, Changliang Xia, Chengyou Jia, Minnan Luo, Xin Shen, Zhuohang Dang.

**Figure 1.** Figure 1: OmniTryOn successfully achieves Try-On Anything by simultaneously transferring diverse wearable objects in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Visual comparison of inputs and synthesis quality. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the TryAny-Bench data construction pipeline and evaluation protocol. Left: The automated data con [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the OmniTryOn framework. To achieve Try-On Anything at once, the encoded wearable object image [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comprehensive VQA-based evaluation results. Our [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between OmniTryOn and state-of-the-art baselines. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new Try-On Anything task for multi-object video try-on without masks, releases TryAny-Bench, and describes a prior-free model, but the outperformance claims rest on experiments with no metrics shown.

read the letter

The main thing to know is that this work shifts video virtual try-on from single-garment, mask-dependent setups to a multi-object version that runs in one pass with no external priors. It defines the Try-On Anything task, introduces the TryAny-Bench benchmark with paired videos and an evaluation protocol, and presents OmniTryOn built around a First Frame Wearable Cache, STC-RoPE for spatiotemporal consistency, and Gradual Try-On training.

What is new is the task framing itself and the decision to drop masks entirely. The cache pulls wearable objects straight from the first frame, which removes the need for separate garment inputs. STC-RoPE adds explicit space-time anchors to the positional encoding, and the gradual training ramps up from simpler to more complex cases. Releasing the dataset, code, and models is a practical step that lets others test the benchmark directly.

The approach lines up with the stated problems in prior work. Avoiding masks should help keep physical dynamics and visual quality intact, and handling multiple objects at once is a clear extension of the single-garment line.

The soft spot is the evidence. The abstract states that OmniTryOn outperforms both specialized video try-on models and general video editing baselines on TryAny-Bench and sets a new standard, yet it gives no numbers, no ablation results, no dataset statistics, and no error bars. Without those details the performance claim cannot be checked, and it is unclear whether the cache plus STC-RoPE combination actually maintains consistency across diverse objects and complex motions. That is the part that needs the full results tables.

This is for people working on generative video models, especially those focused on editing or virtual try-on applications. A reader who wants a new benchmark or ideas for prior-free generation would get direct use from the released materials.

It deserves a serious referee because the task and method are coherent and address documented limitations in the area. The experiments should be reviewed in detail before any stronger claims are accepted.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Try-On Anything task for simultaneous multi-object wearable transfer in a single video inference pass without external priors such as garment masks. It defines TryAny-Bench as a paired video dataset with a tailored evaluation protocol, and proposes OmniTryOn, which uses a First Frame Wearable Cache to supply objects, Spatiotemporally Consistent RoPE (STC-RoPE) for motion and background preservation, and Gradual Try-On (GTO) training. The central claim is that OmniTryOn significantly outperforms specialized video virtual try-on models and general video editing baselines on TryAny-Bench.

Significance. If the results hold, the work is significant for defining a new multi-object video try-on paradigm that avoids external priors known to degrade physical dynamics. The open release of the benchmark dataset, code, and models is a clear strength that supports reproducibility and future work. The approach could set a new standard if the proposed components demonstrably maintain spatiotemporal consistency across diverse objects.

major comments (1)

Abstract: the claim that OmniTryOn 'significantly outperforms' existing models is presented without any quantitative metrics, tables, error bars, dataset statistics, or ablation results, which is load-bearing for the central experimental claim and prevents verification of the data-to-claim link.

minor comments (1)

Abstract, GitHub link: the URL contains a typographical error ('OminTryOn' instead of 'OmniTryOn').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the concern below and will make the requested revision to strengthen the presentation of our central claim.

read point-by-point responses

Referee: [—] Abstract: the claim that OmniTryOn 'significantly outperforms' existing models is presented without any quantitative metrics, tables, error bars, dataset statistics, or ablation results, which is load-bearing for the central experimental claim and prevents verification of the data-to-claim link.

Authors: We agree that the abstract's claim would be more verifiable if supported by key quantitative results. While the full manuscript provides detailed tables, error bars, dataset statistics, and ablations in the experiments section, we will revise the abstract to concisely include representative metrics demonstrating outperformance on TryAny-Bench (e.g., improvements over baselines in the primary evaluation metrics). This will make the abstract self-contained without exceeding typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces the Try-On Anything task, TryAny-Bench benchmark, and OmniTryOn framework (First Frame Wearable Cache + STC-RoPE + GTO) as new contributions without any self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations. No derivation chain reduces to its own inputs by construction; performance claims rest on external comparison to prior specialized models on the newly defined benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5820 in / 1131 out tokens · 20026 ms · 2026-06-27T18:59:04.944153+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 26 canonical work pages · 15 internal anchors

[1]

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. 2025. Video-As-Prompt: Unified Semantic Control for Video Genera- tion.arXiv preprint arXiv:2510.20888(2025)

work page arXiv 2025
[2]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22563–22575. 8

2023
[4]

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Ming- ming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. 2025. Go-with- the-flow: Motion-controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference. 13–23

2025
[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

2017
[6]

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. 2025. First Frame Is the Place to Go for Video Content Customization.arXiv preprint arXiv:2511.15700 (2025)

work page arXiv 2025
[7]

Yiyang Chen, Xuanhua He, Xiujun Ma, and Yue Ma. 2025. Contextflow: Training- free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818(2025)

work page arXiv 2025
[8]

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. 2025. Catv2ton: Taming diffu- sion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325(2025)

work page arXiv 2025
[9]

Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. 2019. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF international conference on computer vision. 1161– 1170

2019
[10]

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. 2024. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794(2024)

work page arXiv 2024
[11]

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2025. Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Transactions on Circuits and Systems for Video Technology(2025)

2025
[12]

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. 2024. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision. Springer, 205–224

2024
[13]

Google DeepMind. 2026. Gemini 3 Flash Preview. https://ai.google.dev/gemini- api/docs/models/gemini-3-flash-preview. Accessed: 2026-03-22

2026
[14]

Google DeepMind. 2026. Gemini 3.1 Flash Image (Nano Banana 2). https:// aistudio.google.com/models/gemini-3-1-flash-image. Accessed: 2026-03-13

2026
[15]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. 2024. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, and Yabiao Wang. 2025. The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection.arXiv preprint arXiv:2512.20340(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, and Kun Gai. 2025. Fulldit2: Effi- cient in-context conditioning for video diffusion transformers.arXiv preprint arXiv:2506.04213(2025)

work page arXiv 2025
[19]

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. 2024. In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775(2024)

work page arXiv 2024
[20]

Yuyang Huang, Yabo Chen, Li Ding, Xiaopeng Zhang, Wenrui Dai, Junni Zou, Hongkai Xiong, and Qi Tian. 2025. Im-zero: Instance-level motion controllable video generation in a zero-shot manner. InProceedings of the Computer Vision and Pattern Recognition Conference. 7265–7275

2025
[21]

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, and Xiongkuo Min. 2025. Vqa2: visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia. 6751–6760

2025
[22]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu
[23]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17191–17202
[24]

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher- Shlizerman. 2023. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22680– 22690

2023
[25]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Guojun Lei, Chi Wang, Rong Zhang, Yikai Wang, Hong Li, and Weiwei Xu. 2025. Animateanything: Consistent and controllable animation for video generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 27946– 27956

2025
[27]

Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng-Tao Jiang. 2025. MagicTryOn: Harnessing Diffu- sion Transformer for Garment-Preserving Video Virtual Try-on.arXiv preprint arXiv:2505.21325(2025)

work page arXiv 2025
[28]

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2020. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6 (2020), 3260–3271

2020
[29]

Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. 2025. Realvvt: Towards photorealistic video virtual try-on via spatio- temporal consistency.arXiv preprint arXiv:2501.08682(2025)

work page arXiv 2025
[30]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le
[31]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. 2023. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations

2023
[33]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Zhenglin Pan. 2025. AniLines - Anime Lineart Extractor. https://github.com/ zhenglinpan/AniLines-Anime-Lineart-Extractor

2025
[35]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

2023
[36]

Yanyun Pu, Kehan Li, Zeyi Huang, Zhijie Zhong, and Kaixiang Yang. 2025. MVQA- 68K: A Multi-dimensional and Causally-annotated Dataset with Quality Inter- pretability for Video Assessment. InProceedings of the 33rd ACM International Conference on Multimedia. 11189–11198

2025
[37]

Zhefan Rao, Haoxuan Che, Ziwen Hu, Bin Zou, Yaofang Liu, Xuanhua He, Chong- Hou Choi, Yuyang He, Haoyu Chen, Jingran Su, Yanheng Li, Meng Chu, Chenyang Lei, Guanhua Zhao, Zhaoqing Li, Xichen Zhang, Anping Li, Lin Liu, Dandan Tu, and Rui Liu. 2026. Capybara: A Unified Visual Creation Model

2026
[38]

Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, and Abhinav Shrivastava. 2024. Customize-a-video: One-shot motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision. Springer, 332–349

2024
[39]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241

2015
[40]

Henry Ruhs et al. 2023. FaceFusion: Next generation face swapper and enhancer. https://github.com/facefusion/facefusion

2023
[41]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

2024
[42]

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang
[43]

In Proceedings of the IEEE/CVF International Conference on Computer Vision

Ominicontrol: Minimal and universal control for diffusion transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14940– 14950
[44]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.203143, 4 (2025), 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

2004
[49]

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. 2025. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He
[51]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 18682–18692
[52]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick
[53]

https://github.com/facebookresearch/detectron2

Detectron2. https://github.com/facebookresearch/detectron2
[54]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500. 9

2017
[55]

Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, and Changxin Gao. 2024. Tunnel try-on: Ex- cavating spatial-temporal tunnels for high-quality virtual try-on in videos. In Proceedings of the 32nd ACM International Conference on Multimedia. 3199–3208

2024
[56]

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. 2025. Unified Video Editing with Temporal Reasoner.arXiv preprint arXiv:2512.07469 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di ZHANG, Kun Gai, Qifeng Chen, and Wenhan Luo. [n. d.]. Unified In-Context Video Editing. InThe Fourteenth International Conference on Learning Representa- tions
[59]

Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun, Dongyang Jin, Ryan Xu, Nannan Zhang, Dan Song, and Xiangxiang Chu. 2025. Eevee: Towards Close-up High-resolution Video-based Virtual Try-on.arXiv preprint arXiv:2511.18957(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
[61]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595
[62]

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. 2025. Region-Constraint In-Context Generation for Instructional Video Editing.arXiv preprint arXiv:2512.17650(2025)

work page arXiv 2025
[63]

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, and Xin Dong. 2025. Dreamvvt: Mastering realistic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807(2025). 10

work page arXiv 2025

[1] [1]

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. 2025. Video-As-Prompt: Unified Semantic Control for Video Genera- tion.arXiv preprint arXiv:2510.20888(2025)

work page arXiv 2025

[2] [2]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22563–22575. 8

2023

[4] [4]

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Ming- ming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. 2025. Go-with- the-flow: Motion-controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference. 13–23

2025

[5] [5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

2017

[6] [6]

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. 2025. First Frame Is the Place to Go for Video Content Customization.arXiv preprint arXiv:2511.15700 (2025)

work page arXiv 2025

[7] [7]

Yiyang Chen, Xuanhua He, Xiujun Ma, and Yue Ma. 2025. Contextflow: Training- free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818(2025)

work page arXiv 2025

[8] [8]

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. 2025. Catv2ton: Taming diffu- sion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325(2025)

work page arXiv 2025

[9] [9]

Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. 2019. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF international conference on computer vision. 1161– 1170

2019

[10] [10]

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. 2024. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794(2024)

work page arXiv 2024

[11] [11]

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2025. Lmm-vqa: Advancing video quality assessment with large multimodal models.IEEE Transactions on Circuits and Systems for Video Technology(2025)

2025

[12] [12]

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. 2024. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision. Springer, 205–224

2024

[13] [13]

Google DeepMind. 2026. Gemini 3 Flash Preview. https://ai.google.dev/gemini- api/docs/models/gemini-3-flash-preview. Accessed: 2026-03-22

2026

[14] [14]

Google DeepMind. 2026. Gemini 3.1 Flash Image (Nano Banana 2). https:// aistudio.google.com/models/gemini-3-1-flash-image. Accessed: 2026-03-13

2026

[15] [15]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. 2024. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, and Yabiao Wang. 2025. The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection.arXiv preprint arXiv:2512.20340(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, and Kun Gai. 2025. Fulldit2: Effi- cient in-context conditioning for video diffusion transformers.arXiv preprint arXiv:2506.04213(2025)

work page arXiv 2025

[19] [19]

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. 2024. In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775(2024)

work page arXiv 2024

[20] [20]

Yuyang Huang, Yabo Chen, Li Ding, Xiaopeng Zhang, Wenrui Dai, Junni Zou, Hongkai Xiong, and Qi Tian. 2025. Im-zero: Instance-level motion controllable video generation in a zero-shot manner. InProceedings of the Computer Vision and Pattern Recognition Conference. 7265–7275

2025

[21] [21]

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, and Xiongkuo Min. 2025. Vqa2: visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia. 6751–6760

2025

[22] [22]

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu

[23] [23]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17191–17202

[24] [24]

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher- Shlizerman. 2023. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22680– 22690

2023

[25] [25]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Guojun Lei, Chi Wang, Rong Zhang, Yikai Wang, Hong Li, and Weiwei Xu. 2025. Animateanything: Consistent and controllable animation for video generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 27946– 27956

2025

[27] [27]

Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng-Tao Jiang. 2025. MagicTryOn: Harnessing Diffu- sion Transformer for Garment-Preserving Video Virtual Try-on.arXiv preprint arXiv:2505.21325(2025)

work page arXiv 2025

[28] [28]

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2020. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6 (2020), 3260–3271

2020

[29] [29]

Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. 2025. Realvvt: Towards photorealistic video virtual try-on via spatio- temporal consistency.arXiv preprint arXiv:2501.08682(2025)

work page arXiv 2025

[30] [30]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le

[31] [31]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. 2023. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations

2023

[33] [33]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Zhenglin Pan. 2025. AniLines - Anime Lineart Extractor. https://github.com/ zhenglinpan/AniLines-Anime-Lineart-Extractor

2025

[35] [35]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

2023

[36] [36]

Yanyun Pu, Kehan Li, Zeyi Huang, Zhijie Zhong, and Kaixiang Yang. 2025. MVQA- 68K: A Multi-dimensional and Causally-annotated Dataset with Quality Inter- pretability for Video Assessment. InProceedings of the 33rd ACM International Conference on Multimedia. 11189–11198

2025

[37] [37]

Zhefan Rao, Haoxuan Che, Ziwen Hu, Bin Zou, Yaofang Liu, Xuanhua He, Chong- Hou Choi, Yuyang He, Haoyu Chen, Jingran Su, Yanheng Li, Meng Chu, Chenyang Lei, Guanhua Zhao, Zhaoqing Li, Xichen Zhang, Anping Li, Lin Liu, Dandan Tu, and Rui Liu. 2026. Capybara: A Unified Visual Creation Model

2026

[38] [38]

Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, and Abhinav Shrivastava. 2024. Customize-a-video: One-shot motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision. Springer, 332–349

2024

[39] [39]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241

2015

[40] [40]

Henry Ruhs et al. 2023. FaceFusion: Next generation face swapper and enhancer. https://github.com/facefusion/facefusion

2023

[41] [41]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

2024

[42] [42]

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

[43] [43]

In Proceedings of the IEEE/CVF International Conference on Computer Vision

Ominicontrol: Minimal and universal control for diffusion transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14940– 14950

[44] [44]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [47]

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.203143, 4 (2025), 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

2004

[48] [49]

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. 2025. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He

[50] [51]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 18682–18692

[51] [52]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick

[52] [53]

https://github.com/facebookresearch/detectron2

Detectron2. https://github.com/facebookresearch/detectron2

[53] [54]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500. 9

2017

[54] [55]

Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, and Changxin Gao. 2024. Tunnel try-on: Ex- cavating spatial-temporal tunnels for high-quality virtual try-on in videos. In Proceedings of the 32nd ACM International Conference on Multimedia. 3199–3208

2024

[55] [56]

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. 2025. Unified Video Editing with Temporal Reasoner.arXiv preprint arXiv:2512.07469 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [57]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [58]

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di ZHANG, Kun Gai, Qifeng Chen, and Wenhan Luo. [n. d.]. Unified In-Context Video Editing. InThe Fourteenth International Conference on Learning Representa- tions

[58] [59]

Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun, Dongyang Jin, Ryan Xu, Nannan Zhang, Dan Song, and Xiangxiang Chu. 2025. Eevee: Towards Close-up High-resolution Video-based Virtual Try-on.arXiv preprint arXiv:2511.18957(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [60]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

[60] [61]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

[61] [62]

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. 2025. Region-Constraint In-Context Generation for Instructional Video Editing.arXiv preprint arXiv:2512.17650(2025)

work page arXiv 2025

[62] [63]

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [64]

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, and Xin Dong. 2025. Dreamvvt: Mastering realistic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807(2025). 10

work page arXiv 2025