pith. machine review for the scientific record. sign in

arxiv: 2604.07210 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

Cong Wang, Fei Shen, Jian Yu, Jinhui Tang, Si Shen, Xiaoyu Du, Yi Xin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords fashion image synthesisdiffusion modelsmixture of expertspreference optimizationgarment generationvirtual dressingcontrollable generationunified framework
0
0 comments X

The pith

A trait-routing attention module with mixture-of-experts and an automated preference pipeline together unify garment generation and virtual dressing inside one diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior diffusion approaches handle garment creation and virtual try-on as isolated tasks, often producing entangled attributes when multiple conditions are supplied. VersaVogue introduces a single framework that routes visual traits dynamically to specialized experts and layers while building its own preference data from content, text, and quality signals. The resulting model delivers higher visual fidelity, tighter semantic alignment, and finer user control across both design and showcase stages of fashion workflows. If the routing and alignment steps hold, practitioners could replace separate pipelines with one controllable generator.

Core claim

VersaVogue jointly supports garment generation and virtual dressing by replacing static feature concatenation with a trait-routing attention module that uses mixture-of-experts routing to send each condition to the most compatible expert and generative layer, thereby disentangling texture, shape, and color. An automated multi-perspective preference optimization pipeline then assembles reliable preference pairs from fidelity, alignment, and perceptual evaluators and applies direct preference optimization, removing the need for human labels or task-specific reward models.

What carries the argument

The trait-routing attention (TA) module, a mixture-of-experts mechanism that dynamically assigns condition features to the most compatible experts and generative layers for disentangled attribute injection.

If this is right

  • One model replaces two separate systems for the design and showcase stages of the fashion lifecycle.
  • Dynamic expert routing reduces attribute entanglement that arises from simple concatenation of heterogeneous conditions.
  • Preference pairs built from multiple automated evaluators allow direct preference optimization without human annotation or custom reward models.
  • Benchmarks show consistent gains in visual fidelity, semantic consistency, and fine-grained controllability for both garment generation and virtual dressing.
  • The framework handles multi-source conditions such as text, reference images, and masks within a single generative process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-plus-preference pattern could be tested on other multi-condition image tasks such as interior layout or product visualization.
  • If the MPO pipeline generalizes, it offers a template for reward-free alignment in additional diffusion domains.
  • Real-time fashion applications could combine the unified model with live user inputs to shorten the path from sketch to virtual model.

Load-bearing premise

The automated evaluators of content fidelity, textual alignment, and perceptual quality produce unbiased preference pairs that correctly reflect human preference.

What would settle it

A large-scale human preference study on multi-condition fashion prompts in which raters consistently choose outputs from prior separate models over VersaVogue outputs would falsify the claimed gains in fidelity and controllability.

Figures

Figures reproduced from arXiv: 2604.07210 by Cong Wang, Fei Shen, Jian Yu, Jinhui Tang, Si Shen, Xiaoyu Du, Yi Xin.

Figure 1
Figure 1. Figure 1: Differences in workflows and input conditions be [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architectural overview of VersaVogue. (a) First, distinct conditions are processed through isolated self-attention layers [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with baseline methods on (a) garment generation and (b) multi-garment virtual dressing. Our [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with SOTA methods on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study results. Higher scores indicate better [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dynamics of routing entropy across SDXL layers [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Real-world applications of VersaVogue, bridging [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on the number of experts ( [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of CLIP-I (left) and FID (right) scores [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More qualitative comparisons I. two independent, disparate models. A shared framework ensures inherent semantic compatibility. Whereas heterogeneous models often suffer from a representation gap that causes fine-grained tex￾tures to degrade during transfer, our VersaVogue model maintains a consistent latent space that preserves intricate garment attributes. This synergy is critical for maintaining high-fi… view at source ↗
Figure 12
Figure 12. Figure 12: More qualitative comparisons II. contrast to PPO-based approaches that require both a reward model and a value network, DPO directly optimizes the policy based on preference data. Formally, the standard RLHF objective seeks a policy 𝜋𝜃 that maximizes a learned reward 𝑟𝜙 (𝑥, 𝑦) subject to a KL￾divergence constraint against a reference model 𝜋ref. This constraint prevents reward over-optimization and ensure… view at source ↗
Figure 13
Figure 13. Figure 13: More qualitative comparisons III [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples from the preference dataset synthesized [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly addresses garment generation and virtual dressing. It proposes a trait-routing attention (TA) module leveraging mixture-of-experts to dynamically route condition features for disentangled injection of attributes such as texture, shape, and color. An automated multi-perspective preference optimization (MPO) pipeline constructs preference pairs using three evaluators (content fidelity, textual alignment, perceptual quality) without human annotation, which are then used to optimize the model via direct preference optimization (DPO). The central claim is that VersaVogue consistently outperforms prior methods in visual fidelity, semantic consistency, and fine-grained controllability on garment generation and virtual dressing benchmarks.

Significance. If the results hold, the work could advance unified fashion synthesis by reducing task fragmentation and attribute entanglement through dynamic routing, while offering a scalable alternative to human-annotated preference data via automated evaluators. The MPO approach, if validated, would be particularly impactful for alignment in generative vision models.

major comments (2)
  1. [§3.3] §3.3: The MPO pipeline constructs preference pairs exclusively from three automated evaluators (content fidelity, textual alignment, perceptual quality) and applies them directly in DPO. No correlation analysis, human study, or inter-evaluator agreement metrics are reported to establish that these rankings align with human perception of fashion attributes (e.g., texture fidelity or fine-grained controllability). This is load-bearing for the claimed gains in semantic consistency and visual fidelity, as misalignment would cause DPO to optimize toward evaluator artifacts rather than the intended objectives.
  2. [Results section] Results section: The manuscript asserts extensive experiments showing consistent outperformance, yet provides no ablation tables isolating the TA module versus the MPO pipeline, no quantitative effect sizes or statistical significance tests on benchmark metrics, and no error analysis or failure-case breakdowns. Without these, the strongest claim of superiority across visual fidelity, semantic consistency, and controllability cannot be rigorously evaluated.
minor comments (1)
  1. [§3.2] The notation for the trait-routing attention module could be clarified with an explicit equation for the expert routing weights and layer selection to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important aspects of validation for the MPO pipeline and the rigor of our experimental analysis. We address each point below and commit to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [§3.3] §3.3: The MPO pipeline constructs preference pairs exclusively from three automated evaluators (content fidelity, textual alignment, perceptual quality) and applies them directly in DPO. No correlation analysis, human study, or inter-evaluator agreement metrics are reported to establish that these rankings align with human perception of fashion attributes (e.g., texture fidelity or fine-grained controllability). This is load-bearing for the claimed gains in semantic consistency and visual fidelity, as misalignment would cause DPO to optimize toward evaluator artifacts rather than the intended objectives.

    Authors: We agree that explicit validation of the automated evaluators against human perception is necessary to support the claims. The current manuscript relies on established metrics for each evaluator (e.g., CLIP-based textual alignment, perceptual quality via LPIPS/FID variants, and content fidelity via reconstruction metrics), but does not report inter-evaluator agreement or human correlation. In the revision, we will add: (1) pairwise Pearson correlations and Cohen's kappa between the three evaluators on a held-out set of 500 samples; (2) a small-scale human study (20 participants, 200 image pairs) rating preference alignment on texture fidelity, shape controllability, and overall quality, with results reported in a new subsection of §3.3. This will quantify how well the MPO rankings match human judgments and mitigate concerns about evaluator artifacts. revision: yes

  2. Referee: [Results section] Results section: The manuscript asserts extensive experiments showing consistent outperformance, yet provides no ablation tables isolating the TA module versus the MPO pipeline, no quantitative effect sizes or statistical significance tests on benchmark metrics, and no error analysis or failure-case breakdowns. Without these, the strongest claim of superiority across visual fidelity, semantic consistency, and controllability cannot be rigorously evaluated.

    Authors: The referee correctly notes that the current results section lacks component-wise ablations, statistical tests, and failure analysis. While the manuscript includes overall benchmark comparisons, it does not isolate the Trait-Routing Attention (TA) module from the MPO pipeline or provide effect sizes. We will revise the Results section to include: (1) ablation tables showing performance with/without TA and with/without MPO on both garment generation and virtual dressing tasks; (2) statistical significance via paired t-tests and Cohen's d effect sizes on primary metrics (FID, CLIP score, controllability accuracy); (3) a dedicated error analysis subsection with quantitative breakdown of failure modes (e.g., by attribute type: texture vs. shape) and qualitative examples of remaining limitations. These additions will allow rigorous evaluation of each component's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims.

full rationale

The paper's core contributions—the trait-routing attention module using mixture-of-experts for condition injection and the MPO pipeline that constructs preference pairs via three automated evaluators before applying standard DPO—are presented as independent methodological innovations. The outperformance claims rest on benchmark experiments rather than any reduction of results to fitted parameters, self-definitions, or self-citation chains by construction. No equations or steps equate the final performance metrics to the input evaluators or routing logic; the MPO reliability is an empirical assumption validated externally via experiments, not a tautology. This keeps the derivation self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard diffusion model assumptions and DPO; the new elements are the routing module and the automatic preference construction, which are introduced without additional free parameters or invented physical entities listed in the abstract.

axioms (2)
  • domain assumption Diffusion models can be conditioned on heterogeneous visual and textual inputs
    Invoked in the opening motivation for unified multi-condition synthesis.
  • ad hoc to paper Mixture-of-experts routing can achieve disentangled attribute injection
    Core premise of the trait-routing attention module.

pith-pipeline@v0.9.0 · 5555 in / 1322 out tokens · 43924 ms · 2026-05-10T18:55:51.796555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)

  2. [2]

    Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika39, 3/4 (1952), 324–345

  3. [3]

    Weifeng Chen, Tao Gu, Yuhao Xu, and Arlene Chen. 2024. Magic clothing: Controllable garment-driven image synthesis. InProceedings of the 32nd ACM International Conference on Multimedia. 6939–6948

  4. [4]

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao

  5. [5]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6593–6602

  6. [6]

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. VITON- HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR)

  7. [7]

    Zheng Chong, Yanwei Lei, Shiyue Zhang, Zhuandi He, Zhen Wang, Xujie Zhang, Xiao Dong, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. 2025. FastFit: Ac- celerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models. arXiv:2508.20586 [cs.CV] https://arxiv.org/abs/2508.20586

  8. [8]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  9. [9]

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang

  10. [10]

    Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633(2024)

  11. [11]

    Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel Cohen-Or. 2024. Implicit style-content separation using b-lora. InEuropean Conference on Computer Vision. Springer, 181–198

  12. [12]

    Masato Fujitake. 2024. Rl-logo: Deep reinforcement learning localization for logo recognition. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2830–2834

  13. [13]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

  14. [14]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  15. [15]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598(2022)

  16. [16]

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. 2024. CogVLM2: Visual Language Models for Image and Video Understanding. arXiv:2408.16500 [cs.CV]

  17. [17]

    Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. 2024. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8176–8185

  18. [18]

    Dongxu Li, Junnan Li, and Steven Hoi. 2023. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems36 (2023), 30146–30166

  19. [19]

    Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, and Qian He. 2025. Anydressing: Customizable multi-garment virtual dressing via latent diffusion models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 23723–23733

  20. [20]

    Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, and Xiao- dan Liang. 2025. Dreamfit: Garment-centric human generation via a lightweight anything-dressing encoder. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5218–5226

  21. [21]

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent con- sistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378(2023)

  22. [22]

    Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. 2022. Dress Code: High-Resolution Multi-Category Virtual Try-On. InProceedings of the European Conference on Computer Vision

  23. [23]

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. 2024. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 4296–4304

  24. [24]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

  25. [25]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffu- sion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

  26. [26]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  27. [27]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  28. [28]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  29. [29]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  30. [30]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  31. [31]

    Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinhui Tang. 2025. Imagdressing-v1: Customizable virtual dressing. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6795–6804

  32. [32]

    Fei Shen, Jian Yu, Cong Wang, Xin Jiang, Xiaoyu Du, and Jinhui Tang. 2025. Imaggarment-1: Fine-grained garment generation for controllable fashion design. arXiv preprint arXiv:2504.13176(2025)

  33. [33]

    Jiaming Song, Chenlin Meng, and Stefano Ermon. 2022. Denoising Diffusion Implicit Models. arXiv:2010.02502 [cs.LG] https://arxiv.org/abs/2010.02502

  34. [34]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

  35. [35]

    InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Diffusion model alignment using direct preference optimization. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8228–8238

  36. [36]

    Haoxuan Wang, Jinlong Peng, Qingdong He, Hao Yang, Ying Jin, Jiafu Wu, Xiaobin Hu, Yanjie Pan, Zhenye Gan, Mingmin Chi, et al . 2025. Unicombine: Unified multi-conditional combination with diffusion transformer.arXiv preprint arXiv:2503.09277(2025)

  37. [37]

    Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. 2024. Instantstyle: Free lunch towards style-preserving in text-to-image generation.arXiv preprint arXiv:2404.02733(2024)

  38. [38]

    Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, and Peipei Li. 2024. Stablegarment: Garment-centric generation via stable diffusion.arXiv preprint arXiv:2403.10783(2024). Yu et al

  39. [39]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

  40. [40]

    Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. 2025. Ootdiffusion: Outfit- ting fusion based latent diffusion for controllable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 8996–9004

  41. [41]

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. 2023. Raphael: Text-to-image generation via large mixture of diffusion paths.Advances in Neural Information Processing Systems36 (2023), 41693–41706

  42. [42]

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. 2024. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8941–8951

  43. [43]

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721(2023)

  44. [44]

    Mingzhe Yu, Yunshan Ma, Lei Wu, Changshuo Wang, Xue Li, and Lei Meng

  45. [45]

    InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

    FashionDPO: Fine-tune Fashion Outfit Generation Model using Direct Preference Optimization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 212–222

  46. [46]

    Kai Zeng, Zhou Wang, Anmin Zhang, Zhaohui Wang, and Wenjun Zhang. 2014. A color structural similarity index for image quality assessment. In2014 IEEE International Conference on Image Processing (ICIP). IEEE, 660–664. doi:10.1109/ ICIP.2014.7025894

  47. [47]

    Cheng Zhang, Dong Gong, Jiumei He, Yu Zhu, Jinqiu Sun, and Yanning Zhang

  48. [48]

    UIR-LoRA: Achieving Universal Image Restoration through Multiple Low- Rank Adaptation.arXiv preprint arXiv:2409.20197(2024)

  49. [49]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. 2022. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv:2203.03605 [cs.CV]

  50. [50]

    Jinghao Zhang, Wen Qian, Hao Luo, Fan Wang, and Feng Zhao. 2024. AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status.arXiv preprint arXiv:2409.17740(2024)

  51. [51]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF inter- national conference on computer vision. 3836–3847

  52. [52]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  53. [53]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  54. [54]

    Xujie Zhang, Yu Sha, Michael C Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, and Xiaodan Liang. 2022. Armani: Part-level garment-text alignment for unified cross-modal fashion design. InProceedings of the 30th ACM International Conference on Multimedia. 4525–4535

  55. [55]

    Xujie Zhang, Binbin Yang, Michael C Kampffmeyer, Wenqing Zhang, Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, and Xiaodan Liang. 2023. Diffcloth: Diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision. 23154–23163

  56. [56]

    Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. 2023. Uni-controlnet: All-in-one control to text- to-image diffusion models.Advances in Neural Information Processing Systems 36 (2023), 11127–11150

  57. [57]

    Mingkang Zhu, Xi Chen, Zhongdao Wang, Hengshuang Zhao, and Jiaya Jia. 2024. Logosticker: Inserting logos into diffusion models for customized generation. In European Conference on Computer Vision. Springer, 363–378

  58. [58]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593(2019). VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis Supplementary Material The appendices pro...