pith. sign in

arxiv: 2607.02220 · v1 · pith:QRDI7QYPnew · submitted 2026-07-02 · 💻 cs.CV

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

Pith reviewed 2026-07-03 15:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords fashion detail generationcross-modal feature alignmentdiffusion transformerDINOv3 distillationfocus conditioningconsistency reward modelvirtual try-on
0
0 comments X

The pith

DetailAnywhere generates photorealistic fashion details from focus markers on garment images via cross-modal distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fashion Detail Generation as a new task where a model must produce a close-up view of a specific garment region, such as a collar or cuff, indicated only by a focus marker on a reference image. It releases FDBench, a benchmark of over 40,000 human-verified reference-detail pairs across 41 categories, to measure performance on this setting. The core technique, Cross-Modal Feature Alignment Distillation, uses a fine-tuned DINOv3 teacher to align the branches of a Multimodal Diffusion Transformer in a shared semantic space. A consistency reward model further guides reinforcement learning to maintain garment identity. The resulting DetailAnywhere model outperforms prior open-source methods on standard metrics and in human evaluations.

Core claim

Cross-Modal Feature Alignment Distillation aligns both branches of a Multimodal Diffusion Transformer in a shared semantic space using a fine-tuned DINOv3 teacher, enabling the model to bridge the correspondence between a focus marker and the corresponding photorealistic close-up while preserving garment identity without any precise prompt.

What carries the argument

Cross-Modal Feature Alignment Distillation (CFAD), which performs dual-branch distillation from a DINOv3 teacher into a Multimodal Diffusion Transformer.

If this is right

  • E-commerce platforms can let users request and receive close-up views of specific garment features directly from product photos.
  • Generated details remain consistent with the reference image's identity even without text descriptions of the detail.
  • The consistency reward model provides a way to optimize diffusion outputs for paired image quality along multiple axes.
  • Non-template detail synthesis becomes feasible for apparel visualization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment strategy might apply to detail generation in other domains such as product design or interior visualization.
  • FDBench could serve as a testbed for studying focus-conditioned generation beyond fashion.
  • The dual-branch distillation might combine with other teacher models to handle additional conditioning signals.

Load-bearing premise

A fine-tuned DINOv3 teacher can align both branches of a Multimodal Diffusion Transformer in a shared semantic space to bridge the correspondence between a focus marker and a photorealistic close-up while preserving garment identity without precise prompts.

What would settle it

An experiment in which DetailAnywhere fails to outperform state-of-the-art open-source methods on FDBench quantitative metrics or in human evaluations would show the distillation approach does not solve the semantic gap.

Figures

Figures reproduced from arXiv: 2607.02220 by Boheng Zhang, Dewen Fan, Fan Yang, Honglie Wang, Huaiqing Wang, Huan Ouyang, Jia Sun, Jiyuan Wang, Junlong Wu, Pengcheng Wei, Qianqian Gan, Tingting Gao, Yimin Zhou, Yongrui Heng, Zijun Li.

Figure 1
Figure 1. Figure 1: Fashion Detail Generation. Given a reference garment image with a bounding box indicating the target region, our model generates a high-fidelity, identity-consistent close-up of the indicated area. Abstract Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background syn￾thesis. However, when making online purchas… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DetailAnywhere. Our framework consists of two stages: (1) CFAD trains a view-bridging teacher to align reference and detail representations, then distills this alignment into both branches of the generator; (2) A consistency reward model fine-tunes the generator via Negative-aware FineTuning to further improve identity preservation and detail fidelity. EditScore [57] focus specifically on the q… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison. Side-by-side visualization of detail generation results from JoyAI-Image-Edit [77], Qwen-Image-Edit (PE) [14], FireRed-Image-Edit (PE) [76], Seedream 5.0 Lite [80], and DetailAnywhere (Ours). Our method produces identity-consistent, photorealistic close-ups with faithful texture and structural preservation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Separation ratio across MMDiT layers. Ratio of the separation score (w/ CFAD / w/o CFAD) for image and text branches. Both branches peak at the alignment target layer 16 (l ∗ ), with the text branch reaching 58.7×. A.8 Additional Visualizations In addition to the main paper visualization, we provide more side-by-side comparisons of generated detail images from different methods on representative FDBench ex… view at source ↗
Figure 5
Figure 5. Figure 5: Additional Fashion Detail Visualizations. Additional visualizations. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper formalizes a new task called Fashion Detail Generation with focus conditioning for e-commerce apparel, releases the FDBench benchmark with 40K+ human-verified reference-detail pairs across 41 categories, proposes Cross-modal Feature Alignment Distillation (CFAD) that uses a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space, introduces a consistency reward model scoring image pairs on three axes and optimized via reinforcement learning, and claims that the resulting DetailAnywhere model significantly outperforms all state-of-the-art open-source methods across all metrics and human evaluations.

Significance. If the outperformance claims hold under rigorous validation, the work would be significant for advancing controllable generative models in fashion e-commerce by addressing an unstudied detail-region task without requiring precise prompts, while providing the first dedicated benchmark; the distillation and RL consistency components represent a targeted technical approach to the semantic gap problem.

minor comments (1)
  1. The abstract states outperformance 'across all metrics' but does not name the specific metrics, baselines, or quantitative values, which hinders immediate assessment of the central empirical claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for acknowledging the task formalization, FDBench benchmark, and the proposed CFAD distillation plus RL consistency components. We note that the report lists no specific major comments, only an overall summary and a conditional significance assessment. We address the recommendation of 'uncertain' by confirming that all experimental claims in the manuscript are supported by the provided metrics, human evaluations, and open-source baselines as described.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided information consists solely of the abstract, which describes the task, benchmark, and proposed method at a high level without any equations, derivations, or specific mathematical steps. No load-bearing claims reduce to self-definitions or fitted inputs by construction, as no such details are available for inspection. The central claim of outperformance is presented as an experimental result rather than a derived prediction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5820 in / 1158 out tokens · 30385 ms · 2026-07-03T15:54:53.009832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 29 canonical work pages · 15 internal anchors

  1. [1]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  2. [2]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  3. [3]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

  4. [4]

    Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  5. [5]

    Tryondiffusion: A tale of two unets

    Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  6. [6]

    Improving diffusion models for authentic virtual try-on in the wild

    Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision (ECCV), 2024

  7. [7]

    Postermaker: Towards high-quality product poster generation with accurate text rendering

    Yifan Gao, Zihang Lin, Chuanbin Liu, Min Zhou, Tiezheng Ge, Bo Zheng, and Hongtao Xie. Postermaker: Towards high-quality product poster generation with accurate text rendering. arXiv preprint arXiv:2504.06632, 2025

  8. [8]

    Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

    Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

  9. [9]

    Multimodal garment designer: Human-centric latent diffusion models for fashion image editing

    Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

  10. [10]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow im- age editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  11. [11]

    Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling

    Nannan Li, Kevin J Shih, and Bryan A Plummer. Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21238–21247, 2025

  12. [12]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  13. [13]

    E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought

    Meiqi Sun, Mingyu Li, and Junxiong Zhu. E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  14. [14]

    Qwen-Image Technical Report

    Qwen Team. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  15. [15]

    FLUX.2: Frontier visual intelligence

    Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

  16. [16]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 11

  17. [17]

    Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think

    Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6146–6156, 2025

  18. [18]

    Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

    Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, and Jong Chul Ye. Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

  19. [19]

    Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

  20. [20]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations (ICLR), 2025

  21. [21]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

  22. [22]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

  23. [23]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2015

  24. [24]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  25. [25]

    Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

  26. [26]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  27. [27]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

  29. [29]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.International Conference on Learning Representations (ICLR), 2023

  30. [30]

    Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

    Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xi- angxiang Chu, Yunchao Wei, Kang Liao, et al. Geometry-guided reinforcement learning for multi-view consistent 3d scene editing.arXiv preprint arXiv:2603.03143, 2026

  31. [31]

    Learning to generate stylized handwritten text via a unified representation of style, content, and noise

    Honglie Wang, Yan-Ming Zhang, Wangzi Yao, Fei Yin, and Cheng-Lin Liu. Learning to generate stylized handwritten text via a unified representation of style, content, and noise. In The Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    Template-guided cascaded diffusion for stylized handwritten chinese text-line generation

    Honglie Wang, Minsi Ren, Yan-Ming Zhang, Fei Yin, and Cheng-Lin Liu. Template-guided cascaded diffusion for stylized handwritten chinese text-line generation. InInternational Conference on Document Analysis and Recognition, pages 149–166. Springer, 2025

  33. [33]

    Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026

    Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, and Yanjie Fu. Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026. 12

  34. [34]

    From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

    JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, and Yao Zhao. From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

  35. [35]

    Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance

    Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, and Ping Tan. Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9094–9102, 2025

  36. [36]

    Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement

    Yichong Xia, Yimin Zhou, Jinpeng Wang, Baoyi An, Haoqian Wang, Yaowei Wang, and Bin Chen. Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement. InThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  38. [38]

    Sdedit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022

  39. [39]

    Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

    Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, and Weisi Lin. Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

  40. [40]

    Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

    Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, and Yao Zhao. Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

  41. [41]

    Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

    Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li, Jingwei Huang, Chunchao Guo, and Ping Tan. Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

  42. [42]

    Digging into con- trastive learning for robust depth estimation with diffusion models

    Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, and Yao Zhao. Digging into con- trastive learning for robust depth estimation with diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 4129–4137. ACM, October 2024. doi: 10.1145/3664647.3681168. URLhttp://dx.doi.org/10.1145/3664647.3681168

  43. [43]

    In: IEEE Int

    Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In2024 IEEE International Conference on Robotics and Automation (ICRA), page 4976–4982. IEEE, May 2024. doi: 10.1109/icra57147.2024.10611100. URL http://dx.d...

  44. [44]

    Denoising diffusion autoencoders are unified self-supervised learners

    Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  45. [45]

    Emer- gent correspondence from image diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  46. [46]

    Cross modal distillation for supervision transfer

    Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  47. [47]

    Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

    Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  48. [48]

    Catvton: Concatenation is all you need for virtual try-on with diffusion models

    Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 13

  49. [49]

    Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

    Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. InACM Multimedia, 2023

  50. [50]

    Texture- preserving diffusion models for high-fidelity virtual try-on

    Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, and Xiangmin Xu. Texture- preserving diffusion models for high-fidelity virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  51. [51]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

  52. [52]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  53. [53]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  54. [54]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems (NeurIPS), 2023

  55. [55]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  56. [56]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

  57. [57]

    Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

  58. [58]

    Deepfm: A factorization-machine based neural network for ctr prediction

    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: A factorization-machine based neural network for ctr prediction. InInternational Joint Conference on Artificial Intelligence (IJCAI), 2017

  59. [59]

    Autopp: Towards automated product poster generation and optimization, 2025

    Jiahao Fan, Yuxin Qin, Wei Feng, Yanyin Chen, Yaoyu Li, Ao Ma, Yixiu Li, Li Zhuang, Haoyi Bian, Zheng Zhang, Jingjing Lv, Junjie Shen, and Ching Law. Autopp: Towards automated product poster generation and optimization, 2025. URL https://arxiv.org/abs/2512. 21921

  60. [60]

    Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

    Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

  61. [61]

    Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning

    Junlong Wu, Yi Cheng, Hang Liu, and Houde Liu. Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10656–10663. IEEE, 2025

  62. [62]

    CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, et al. Cac: Advancing video reward models via hierarchical spatiotemporal concentrating.arXiv preprint arXiv:2605.11723, 2026

  63. [63]

    High quality underwater image compression with adaptive color correction

    Yimin Zhou, Yichong Xia, Sicheng Pan, Bin Chen, Yaowei Li, Jiawei Li, Mingyao Hong, Zhi Wang, and Yaowei Wang. High quality underwater image compression with adaptive color correction. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8587–8591. IEEE, 2026. 14

  64. [64]

    EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, and Wei Ye. Eve: Verifiable self- evolution of mllms via executable visual transformations.arXiv preprint arXiv:2604.18320, 2026

  65. [65]

    FLUX.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

  66. [66]

    Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

    Jeeyung Kim, Erfan Esmaeili, and Qiang Qiu. Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

  67. [67]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  68. [68]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

  69. [69]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  70. [70]

    HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

    Yuhang Ma et al. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

  71. [71]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  72. [72]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

  73. [73]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, et al. Qwen3- vl technical report.arXiv preprint arXiv:2511.21631, 2025

  74. [74]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  75. [75]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. In NeurIPS, 2025

  76. [76]

    FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

    Super Intelligence Team. FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

  77. [77]

    JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, et al. Awaking spatial intelligence in unified multimodal understanding and generation.arXiv preprint arXiv:2605.04128, 2026

  78. [78]

    GPT-Image-1: Image generation in the OpenAI API

    OpenAI. GPT-Image-1: Image generation in the OpenAI API. https://openai.com/index/ image-generation-api/, 2025

  79. [79]

    Nano Banana 2 (Gemini 3.1 Flash Image Preview)

    Google DeepMind. Nano Banana 2 (Gemini 3.1 Flash Image Preview). https: //ai.google.dev/gemini-api/docs/image-generation, 2026. Model ID: gemini-3.1-flash-image-preview; accessed 2026-05-07

  80. [80]

    Seedream 5.0 Lite

    ByteDance Seed. Seedream 5.0 Lite. https://seed.bytedance.com/en/seedream5_0_ lite, 2026

Showing first 80 references.