FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

Ao Shen; Chenhang He; Lei Zhang; Yiqun Ning

arxiv: 2606.12012 · v1 · pith:PJMXRTVLnew · submitted 2026-06-10 · 💻 cs.CV

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

Yiqun Ning , Ao Shen , Chenhang He , Lei Zhang This is my paper

Pith reviewed 2026-06-27 09:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords virtual try-ondiffusion modelsgarment fittingbody shape controlimage synthesismask predictiontexture rectification

0 comments

The pith

Encoding body-garment size in text prompts allows virtual try-on to match real fitting across body shapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current diffusion-based virtual try-on methods treat the task as 2D inpainting and often produce images that look plausible but ignore how garments actually sit on different bodies. FitVTON adds explicit size control by encoding body and garment measurements into structured text prompts and trains on triplets generated from a parameterized garment model. Auxiliary prediction heads generate masks for the garment and exposed body, while a texture rectification stage cleans up appearance artifacts from the simulated data. Evaluation uses a new real-world dataset called FittingEffect3K scored with vision-language models.

Core claim

FitVTON encodes garment-body size through structured text prompts and learns from simulated try-on triplets from parameterized garment model. Auxiliary heads predict masks for garment and exposed body, and a texture rectification stage improves realistic appearance from simulated data, resulting in authentic fitting fidelity with significant sizing accuracy and shape preservation.

What carries the argument

Structured text prompts that encode body-garment size relations, trained on simulated try-on triplets with auxiliary mask heads and texture rectification.

If this is right

Generates try-on images with higher sizing accuracy on varied body shapes than inpainting baselines.
Preserves original garment silhouettes more faithfully while keeping competitive visual quality.
Supports a VLM-based scoring protocol and new dataset for quantitative fit evaluation.
Allows fit control without requiring explicit 3D garment deformation at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Text-based size encoding may extend to controlling other physical properties such as drape or stretch in generated clothing images.
The simulation-to-real pipeline could reduce the need for large paired real-world try-on datasets in future systems.
Direct integration of user-provided measurements into prompts might enable personalized fitting without additional model retraining.

Load-bearing premise

That simulated try-on triplets generated from a parameterized garment model, combined with VLM-based scoring on the new FittingEffect3K dataset, provide a reliable proxy for real-world garment fit across diverse bodies.

What would settle it

Human raters viewing side-by-side photos of the same garments actually worn by people of different sizes consistently prefer outputs from prior methods over FitVTON on fit realism.

Figures

Figures reproduced from arXiv: 2606.12012 by Ao Shen, Chenhang He, Lei Zhang, Yiqun Ning.

**Figure 2.** Figure 2: Overview of FitVTON. (Top) Given a person image, a reference garment, and a GarmentBody Size prompt, FitVTON performs fit-aware virtual try-on with FLUX.1 Kontext backbone, modality-specific LoRA adapters, and dual-branch garment/body mask supervision. (Bottom) Twostage training strategy: Stage I learns prompt-driven fit priors from synthetic simulations, while Stage II rectifies real-image textures by u… view at source ↗

**Figure 3.** Figure 3: Left: Samples of the FittingEffect3K dataset, a real-world benchmark for evaluating fitaware virtual try-on across diverse body shapes, poses, and garments. Right: Our fit-oriented VLM evaluation protocol. Given a real try-on reference and a generated result, the evaluator scores four fit dimensions (GB, T/L, SC, LF; 1–5) with brief evidence and reports the averaged fit score. 4 Experiment We leave all th… view at source ↗

**Figure 4.** Figure 4: Qualitative results on FittingEffect3K. ground-truth reference and generated results from 6 methods are randomly shuffled, and each participant selects the result that best matches the reference in terms of garment fit. Specifically, we recruit 20 participants to evaluate 100 comparison cases, resulting in 2000 selections. For image quality, we use FID [15] and KID [2] to measure distribution-level simila… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on public benchmarks. Left: VITON-HD. Right: DressCode. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation results of the dual-branch mask supervision (left) and texture rectification (right). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Wearing-style control. The updated simulation engine supports one-piece garments, tucked-in upper–lower outfits, and untucked upper–lower outfits, covering common outfit configurations absent from a one-piece-only GarmentCode setting. Wearing-style control for upper–lower outfits. The original GarmentCode setting can represent an outfit as a single stitched garment, which is sufficient for one-piece drapi… view at source ↗

**Figure 8.** Figure 8: Effect of wrist/ankle stop constraints. Without limb-end stop constraints, sleeve cuffs or pant hems may slide past the hand or foot. With the proposed constraints, garment ends stay near the wrist or ankle while preserving natural local folds. lower waistband region; for untucked outfits, the upper garment is simulated outside the waistband so that it drapes over the lower garment [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 9.** Figure 9: Representative female body-shape prototypes. We use 16 SMPL-X body shapes that span four height categories and four body-size categories, providing a compact set of representative bodies for fitting-aware garment simulation [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Supplementary qualitative results of prompt-based fitting tests on upper-body (left) [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Supplementary qualitative ablation results. (a) Dual-branch mask supervision improves [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional fit-oriented comparison with baseline methods. We provide more virtual try-on examples comparing FitVTON with representative baseline methods. These examples complement the main-paper results by showing garment appearance preservation, body-structure consistency, and fitting behavior across additional samples. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison with baseline methods. We provide more virtual try-on examples comparing FitVTON with representative baseline methods. These examples complement the main-paper results by showing garment appearance preservation, body-structure consistency, and texture realism. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: https://zenoning.github.io/FitVTON/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FitVTON adds text-prompt size encoding and auxiliary mask heads to diffusion try-on, but its fit claims depend on unvalidated simulated triplets and VLM scoring on a new dataset.

read the letter

The main thing here is that FitVTON tries to move virtual try-on past pure texture inpainting by encoding body-garment size differences through structured text prompts, training on simulated triplets from a parameterized garment model, and adding two auxiliary heads that predict garment and exposed-body masks. A final texture rectification stage cleans up appearance artifacts from the simulation. That combination is the concrete addition over standard diffusion try-on pipelines.

The auxiliary mask heads and the explicit size prompt look like practical engineering choices that could help preserve shape and silhouette better than inpainting alone. Training from simulated data is a reasonable workaround when real paired try-on data is scarce, and curating FittingEffect3K for evaluation shows they are thinking about real-world fit rather than just visual plausibility.

The soft spot is the evaluation. The paper claims significant sizing accuracy and shape preservation, yet the stress-test concern holds: there is no reported check that the parameterized simulation reproduces actual drape, tension, or occlusion across body shapes, and the VLM scoring protocol on FittingEffect3K is not compared to expert measurements or wear tests. Without those anchors, the quantitative gains over SOTA rest on proxies whose reliability is not demonstrated. If the full paper has ablations and error breakdowns that address this, the contribution strengthens; otherwise the central claim stays tentative.

This paper is for researchers working on fashion-specific generative models and e-commerce try-on systems. A reader already building diffusion pipelines for clothing would get usable ideas from the prompt structure and auxiliary heads. It deserves peer review because the problem is real and the proposed mechanisms are straightforward to test, even if the current evidence for physical fidelity needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces FitVTON, a diffusion-based virtual try-on model that encodes body-garment size relationships via structured text prompts, trains on simulated try-on triplets generated from a parameterized garment model, adds two auxiliary heads to predict garment and exposed-body masks, and applies a texture rectification stage. It curates the real-world FittingEffect3K dataset and introduces a VLM-based scoring protocol; both subjective and quantitative experiments are claimed to show superior fitting fidelity, sizing accuracy, and shape preservation relative to prior SOTA methods while preserving competitive image quality.

Significance. If the simulation-to-real gap and VLM scoring protocol can be shown to align with physical garment fit, the size-control mechanism and auxiliary mask heads would address a recognized limitation of 2D inpainting try-on methods. The work would then be relevant to e-commerce applications requiring plausible fit across body shapes. However, the significance is currently limited by the absence of any reported validation that the simulated triplets reproduce drape, tension, or occlusion behavior and that VLM scores correlate with actual wear measurements.

major comments (2)

[Evaluation] Evaluation section: the claim of 'significant sizing accuracy and authentic fitting fidelity' rests on VLM-based scoring of the newly curated FittingEffect3K dataset, yet no comparison is reported between VLM scores and either expert human fit ratings or direct physical measurements (e.g., garment circumference vs. body circumference at key landmarks). Without such calibration, the quantitative gains cannot be taken as evidence that the method improves physical plausibility.
[Method] Method section (simulation pipeline): the training data are generated from a parameterized garment model, but the manuscript provides neither equations describing the garment parameterization nor any quantitative comparison (e.g., drape error, tension maps) between the simulated triplets and real photographs of the same garments on the same bodies. This gap directly undermines the central assertion that the learned model captures authentic fit.

minor comments (2)

[Abstract] Abstract: 'quantitive' is a typo for 'quantitative'; 'demonstrate' should agree with the singular subject 'FitVTON'.
[Abstract] The abstract states that both subjective and quantitative experiments were performed, but the manuscript does not list the exact metrics, baselines, or number of participants for the subjective study.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions where appropriate. Our responses focus on clarifying existing elements of the work and acknowledging genuine limitations without overstating the current evidence.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim of 'significant sizing accuracy and authentic fitting fidelity' rests on VLM-based scoring of the newly curated FittingEffect3K dataset, yet no comparison is reported between VLM scores and either expert human fit ratings or direct physical measurements (e.g., garment circumference vs. body circumference at key landmarks). Without such calibration, the quantitative gains cannot be taken as evidence that the method improves physical plausibility.

Authors: We acknowledge that the manuscript does not include a direct calibration study correlating VLM scores with expert human fit ratings or physical measurements such as circumference comparisons. The quantitative results rely on the VLM protocol, which is supplemented by subjective human evaluations reported in the experiments. We agree this calibration would provide stronger support for physical plausibility claims. In the revision we will expand the evaluation section to include more explicit alignment between VLM scores and the subjective study outcomes, and we will add a dedicated limitations paragraph noting the absence of physical measurement validation. revision: partial
Referee: [Method] Method section (simulation pipeline): the training data are generated from a parameterized garment model, but the manuscript provides neither equations describing the garment parameterization nor any quantitative comparison (e.g., drape error, tension maps) between the simulated triplets and real photographs of the same garments on the same bodies. This gap directly undermines the central assertion that the learned model captures authentic fit.

Authors: We agree that the current manuscript lacks explicit equations for the garment parameterization and does not report quantitative metrics such as drape error or tension maps comparing simulations to real photographs. The parameterization is based on standard techniques but was not detailed in the main text. We will add the relevant equations and a more complete description of the simulation process to the revised method section. For the quantitative validation gap, we relied on visual fidelity and downstream performance on real data rather than direct simulation-to-real error metrics; we will note this choice and its implications in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a diffusion-based virtual try-on architecture that encodes size via text prompts, trains on triplets from an external parameterized garment simulator, adds auxiliary mask heads, and applies a texture rectification stage. Evaluation uses a separately curated real-world dataset (FittingEffect3K) with VLM scoring. None of the load-bearing steps reduce by definition or self-citation to the target outputs; the simulation source and dataset curation are presented as independent inputs rather than tautological re-labelings of the model's predictions. No self-definitional equations, fitted-input-as-prediction patterns, or uniqueness theorems imported from the authors' prior work appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, parameters, or background assumptions can be extracted.

pith-pipeline@v0.9.1-grok · 5722 in / 1001 out tokens · 16855 ms · 2026-06-27T09:51:56.425188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Chatgarment: Garment estimation, generation and editing via large language models

Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J Black, and Yao Feng. Chatgarment: Garment estimation, generation and editing via large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2924–2934, 2025

2025
[2]

Demystifying MMD GANs

Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2025-01

2024
[4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 14131–14140, 2021

2021
[5]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024

2024
[6]

Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Han- qing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

work page arXiv 2024
[7]

Accurate 3d body shape regression using metric and semantic attributes

Vasileios Choutas, Lea Müller, Chun-Hao P Huang, Siyu Tang, Dimitrios Tzionas, and Michael J Black. Accurate 3d body shape regression using metric and semantic attributes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2718–2728, 2022

2022
[8]

Clo3d.https://clo3d.com/en/, 2022

CLO3D. Clo3d.https://clo3d.com/en/, 2022

2022
[9]

Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yux- iong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

work page arXiv 2025
[10]

Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images

Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5337–5345, 2019

2019
[11]

Parser-free virtual try-on via distilling appearance flows

Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. Parser-free virtual try-on via distilling appearance flows. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8485–8493, 2021

2021
[12]

Nano banana pro (gemini 3).https://blog.google, 2025

Google. Nano banana pro (gemini 3).https://blog.google, 2025. Large Language and Image Model

2025
[13]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085– 19096, 2025

2025
[14]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018

2018
[15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017
[16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on.arXiv preprint arXiv:2411.10499, 2024

Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangn- ing Zhang, Chengjie Wang, Yunsheng Wu, and Yanwei Fu. Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on.arXiv preprint arXiv:2411.10499, 2024

work page arXiv 2024
[18]

Stableviton: Learn- ing semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learn- ing semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024

2024
[19]

Promptdresser: Improving the quality and controllability of virtual try-on via generative textual prompt and prompt-aware mask

Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and controllability of virtual try-on via generative textual prompt and prompt-aware mask. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16026–16036, 2025

2025
[20]

Garmentcode: Programming parametric sewing patterns.ACM Transactions on Graphics (TOG), 42(6):1–15, 2023

Maria Korosteleva and Olga Sorkine-Hornung. Garmentcode: Programming parametric sewing patterns.ACM Transactions on Graphics (TOG), 42(6):1–15, 2023

2023
[21]

Garmentcodedata: A dataset of 3d made-to-measure garments with sewing patterns

Maria Korosteleva, Timur Levent Kesdogan, Fabian Kemper, Stephan Wenninger, Jasmin Koller, Yuhan Zhang, Mario Botsch, and Olga Sorkine-Hornung. Garmentcodedata: A dataset of 3d made-to-measure garments with sewing patterns. InEuropean Conference on Computer Vision, pages 110–127. Springer, 2024

2024
[22]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4): 1–10, 2021

Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4): 1–10, 2021

2021
[24]

Warpdiffusion: Efficient diffusion model for high-fidelity virtual try-on.arXiv preprint arXiv:2312.03667, 2023

Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang, et al. Warpdiffusion: Efficient diffusion model for high-fidelity virtual try-on.arXiv preprint arXiv:2312.03667, 2023

work page arXiv 2023
[25]

Anyfit: Controllable virtual try-on for any combination of attire across any scenario.Advances in Neural Information Processing Systems, 37:83164–83196, 2024

Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try-on for any combination of attire across any scenario.Advances in Neural Information Processing Systems, 37:83164–83196, 2024

2024
[26]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Bound- aries, Volume 2, pages 851–866. 2023

2023
[27]

Warp: A high-performance python framework for gpu simulation and graphics

Miles Macklin. Warp: A high-performance python framework for gpu simulation and graphics. https://github.com/nvidia/warp, 2022. NVIDIA GPU Technology Conference (GTC)

2022
[28]

Dress code: High-resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022

2022
[29]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023

2023
[30]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

2019
[31]

Style3d.https://www.style3d.com, 2022

Style3D. Style3d.https://www.style3d.com, 2022. 11

2022
[32]

Gen- eralised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Gen- eralised dice overlap as a deep learning loss function for highly unbalanced segmentations. InDeep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 240–248. Springer, 2017

2017
[33]

Incorporating visual corre- spondence into diffusion model for virtual try-on.arXiv preprint arXiv:2505.16977, 2025

Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, and Tao Mei. Incorporating visual corre- spondence into diffusion model for virtual try-on.arXiv preprint arXiv:2505.16977, 2025

work page arXiv 2025
[34]

Jco-mvton: Jointly controllable multi-modal diffusion transformer for mask-free virtual try-on

Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, and Fan Wang. Jco-mvton: Jointly controllable multi-modal diffusion transformer for mask-free virtual try-on. arXiv preprint arXiv:2508.17614, 2025

work page arXiv 2025
[35]

To- ward characteristic-preserving image-based virtual try-on network

Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. To- ward characteristic-preserving image-based virtual try-on network. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 589–604, 2018

2018
[36]

Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning

Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23550–23559, 2023

2023
[37]

Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on

Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8996–9004, 2025

2025
[38]

Towards photo-realistic virtual try-on by adaptively generating-preserving image content

Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7850– 7859, 2020

2020
[39]

Texture- preserving diffusion models for high-fidelity virtual try-on

Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, and Xiangmin Xu. Texture- preserving diffusion models for high-fidelity virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7017–7026, 2024

2024
[40]

D 4-vton: Dynamic semantics disentangling for differential diffusion based virtual try-on

Zhaotong Yang, Zicheng Jiang, Xinzhe Li, Huiyu Zhou, Junyu Dong, Huaidong Zhang, and Yong Du. D 4-vton: Dynamic semantics disentangling for differential diffusion based virtual try-on. InEuropean Conference on Computer Vision, pages 36–52. Springer, 2024

2024
[41]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[42]

Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training

Xuanpu Zhang, Dan Song, Pengxin Zhan, Tianyu Chang, Jianhao Zeng, Qingguo Chen, Wei- hua Luo, and An-An Liu. Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26399–26408, 2025

2025
[43]

Mmtryon: Multi-modal multi-reference control for high-quality fashion generation

Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, and Xiaodan Liang. Mmtryon: Multi-modal multi-reference control for high-quality fashion generation. arXiv preprint arXiv:2405.00448, 2024

work page arXiv 2024
[44]

Design2garmentcode: Turning design concepts to tangible garments through program synthesis

Feng Zhou, Ruiyang Liu, Chen Liu, Gaofeng He, Yong-Lu Li, Xiaogang Jin, and Huamin Wang. Design2garmentcode: Turning design concepts to tangible garments through program synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23712–23722, 2025

2025
[45]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Moham- mad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023. 12 Appendix In this appendix, we provide supplementary materials for FitVTON. A....

2023

[1] [1]

Chatgarment: Garment estimation, generation and editing via large language models

Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J Black, and Yao Feng. Chatgarment: Garment estimation, generation and editing via large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2924–2934, 2025

2025

[2] [2]

Demystifying MMD GANs

Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2025-01

2024

[4] [4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 14131–14140, 2021

2021

[5] [5]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024

2024

[6] [6]

Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Han- qing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

work page arXiv 2024

[7] [7]

Accurate 3d body shape regression using metric and semantic attributes

Vasileios Choutas, Lea Müller, Chun-Hao P Huang, Siyu Tang, Dimitrios Tzionas, and Michael J Black. Accurate 3d body shape regression using metric and semantic attributes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2718–2728, 2022

2022

[8] [8]

Clo3d.https://clo3d.com/en/, 2022

CLO3D. Clo3d.https://clo3d.com/en/, 2022

2022

[9] [9]

Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yux- iong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

work page arXiv 2025

[10] [10]

Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images

Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5337–5345, 2019

2019

[11] [11]

Parser-free virtual try-on via distilling appearance flows

Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. Parser-free virtual try-on via distilling appearance flows. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8485–8493, 2021

2021

[12] [12]

Nano banana pro (gemini 3).https://blog.google, 2025

Google. Nano banana pro (gemini 3).https://blog.google, 2025. Large Language and Image Model

2025

[13] [13]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085– 19096, 2025

2025

[14] [14]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018

2018

[15] [15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017

[16] [16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on.arXiv preprint arXiv:2411.10499, 2024

Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangn- ing Zhang, Chengjie Wang, Yunsheng Wu, and Yanwei Fu. Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on.arXiv preprint arXiv:2411.10499, 2024

work page arXiv 2024

[18] [18]

Stableviton: Learn- ing semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learn- ing semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024

2024

[19] [19]

Promptdresser: Improving the quality and controllability of virtual try-on via generative textual prompt and prompt-aware mask

Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and controllability of virtual try-on via generative textual prompt and prompt-aware mask. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16026–16036, 2025

2025

[20] [20]

Garmentcode: Programming parametric sewing patterns.ACM Transactions on Graphics (TOG), 42(6):1–15, 2023

Maria Korosteleva and Olga Sorkine-Hornung. Garmentcode: Programming parametric sewing patterns.ACM Transactions on Graphics (TOG), 42(6):1–15, 2023

2023

[21] [21]

Garmentcodedata: A dataset of 3d made-to-measure garments with sewing patterns

Maria Korosteleva, Timur Levent Kesdogan, Fabian Kemper, Stephan Wenninger, Jasmin Koller, Yuhan Zhang, Mario Botsch, and Olga Sorkine-Hornung. Garmentcodedata: A dataset of 3d made-to-measure garments with sewing patterns. InEuropean Conference on Computer Vision, pages 110–127. Springer, 2024

2024

[22] [22]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4): 1–10, 2021

Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4): 1–10, 2021

2021

[24] [24]

Warpdiffusion: Efficient diffusion model for high-fidelity virtual try-on.arXiv preprint arXiv:2312.03667, 2023

Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang, et al. Warpdiffusion: Efficient diffusion model for high-fidelity virtual try-on.arXiv preprint arXiv:2312.03667, 2023

work page arXiv 2023

[25] [25]

Anyfit: Controllable virtual try-on for any combination of attire across any scenario.Advances in Neural Information Processing Systems, 37:83164–83196, 2024

Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try-on for any combination of attire across any scenario.Advances in Neural Information Processing Systems, 37:83164–83196, 2024

2024

[26] [26]

Smpl: A skinned multi-person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Bound- aries, Volume 2, pages 851–866. 2023

2023

[27] [27]

Warp: A high-performance python framework for gpu simulation and graphics

Miles Macklin. Warp: A high-performance python framework for gpu simulation and graphics. https://github.com/nvidia/warp, 2022. NVIDIA GPU Technology Conference (GTC)

2022

[28] [28]

Dress code: High-resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022

2022

[29] [29]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023

2023

[30] [30]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

2019

[31] [31]

Style3d.https://www.style3d.com, 2022

Style3D. Style3d.https://www.style3d.com, 2022. 11

2022

[32] [32]

Gen- eralised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Gen- eralised dice overlap as a deep learning loss function for highly unbalanced segmentations. InDeep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 240–248. Springer, 2017

2017

[33] [33]

Incorporating visual corre- spondence into diffusion model for virtual try-on.arXiv preprint arXiv:2505.16977, 2025

Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, and Tao Mei. Incorporating visual corre- spondence into diffusion model for virtual try-on.arXiv preprint arXiv:2505.16977, 2025

work page arXiv 2025

[34] [34]

Jco-mvton: Jointly controllable multi-modal diffusion transformer for mask-free virtual try-on

Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, and Fan Wang. Jco-mvton: Jointly controllable multi-modal diffusion transformer for mask-free virtual try-on. arXiv preprint arXiv:2508.17614, 2025

work page arXiv 2025

[35] [35]

To- ward characteristic-preserving image-based virtual try-on network

Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. To- ward characteristic-preserving image-based virtual try-on network. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 589–604, 2018

2018

[36] [36]

Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning

Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23550–23559, 2023

2023

[37] [37]

Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on

Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8996–9004, 2025

2025

[38] [38]

Towards photo-realistic virtual try-on by adaptively generating-preserving image content

Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7850– 7859, 2020

2020

[39] [39]

Texture- preserving diffusion models for high-fidelity virtual try-on

Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, and Xiangmin Xu. Texture- preserving diffusion models for high-fidelity virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7017–7026, 2024

2024

[40] [40]

D 4-vton: Dynamic semantics disentangling for differential diffusion based virtual try-on

Zhaotong Yang, Zicheng Jiang, Xinzhe Li, Huiyu Zhou, Junyu Dong, Huaidong Zhang, and Yong Du. D 4-vton: Dynamic semantics disentangling for differential diffusion based virtual try-on. InEuropean Conference on Computer Vision, pages 36–52. Springer, 2024

2024

[41] [41]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018

[42] [42]

Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training

Xuanpu Zhang, Dan Song, Pengxin Zhan, Tianyu Chang, Jianhao Zeng, Qingguo Chen, Wei- hua Luo, and An-An Liu. Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26399–26408, 2025

2025

[43] [43]

Mmtryon: Multi-modal multi-reference control for high-quality fashion generation

Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, and Xiaodan Liang. Mmtryon: Multi-modal multi-reference control for high-quality fashion generation. arXiv preprint arXiv:2405.00448, 2024

work page arXiv 2024

[44] [44]

Design2garmentcode: Turning design concepts to tangible garments through program synthesis

Feng Zhou, Ruiyang Liu, Chen Liu, Gaofeng He, Yong-Lu Li, Xiaogang Jin, and Huamin Wang. Design2garmentcode: Turning design concepts to tangible garments through program synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23712–23722, 2025

2025

[45] [45]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Moham- mad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023. 12 Appendix In this appendix, we provide supplementary materials for FitVTON. A....

2023