pith. sign in

arxiv: 2605.28615 · v1 · pith:GTMUY7SOnew · submitted 2026-05-27 · 💻 cs.CV

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

Pith reviewed 2026-06-29 13:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords compositional text-to-image generationdirect preference optimizationBiDPOBiComp datasetregion-aware guidancediffusion modelspreference-based fine-tuningattribute binding
0
0 comments X

The pith

BiDPO jointly optimizes image and text preferences with region guidance to raise compositional fidelity in text-to-image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the persistent difficulty text-to-image models face when prompts require precise attribute bindings, spatial relationships, or object counts. It constructs a large-scale preference dataset BiComp under strict quality controls, then extends Diffusion DPO into a bimodal optimizer that aligns both generated images and text descriptions while adding region-level guidance to focus on relevant image patches. If successful, this preference-tuning route produces models that follow complex prompts more reliably than earlier methods across standard benchmarks. A reader would care because the approach supplies a scalable fine-tuning path that avoids major architectural redesigns.

Core claim

By building the BiComp preference dataset and extending Diffusion DPO to jointly optimize image and text preferences under region-aware guidance, BiDPO substantially raises the compositional fidelity of text-to-image generation and consistently outperforms prior methods on multiple benchmarks.

What carries the argument

BiDPO, the bimodal extension of Diffusion DPO that adds joint image-text preference optimization and region-level guidance applied to the BiComp dataset.

If this is right

  • Models fine-tuned with BiDPO achieve higher accuracy on prompts involving attribute bindings, object relationships, and counting.
  • Region-level guidance produces finer alignment between text concepts and specific image areas than global preference signals alone.
  • Joint image-and-text preference optimization proves more effective for complex prompt following than single-modality DPO variants.
  • The overall pipeline offers a flexible, scalable alternative to architectural modifications for compositional text-to-image tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset-construction pipeline could be reused or adapted for other generation domains that need quality-controlled preference pairs.
  • If the gains hold, similar bimodal preference tuning might reduce reliance on post-generation correction or prompt engineering in production image systems.
  • The region-guidance component suggests a general route for injecting localized supervision into diffusion fine-tuning without full pixel-level labels.

Load-bearing premise

The carefully controlled pipeline used to build the BiComp preference dataset yields training data that effectively improves model behavior.

What would settle it

An evaluation in which BiDPO fails to exceed the compositional scores of prior methods on the same benchmarks would falsify the central performance claim.

read the original abstract

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes BiDPO for compositional text-to-image generation. It introduces a pipeline to build the large-scale BiComp preference dataset under strict quality control, extends Diffusion DPO to jointly optimize image and text preferences, and adds region-level guidance for fine-grained alignment. Experiments claim that BiDPO substantially improves compositional fidelity and outperforms prior methods on multiple benchmarks.

Significance. If the BiComp dataset and reported gains prove robust, the bimodal DPO extension combined with region guidance would supply a scalable, preference-based alternative to existing compositional T2I techniques, with clear potential to improve attribute binding, relations, and counting in diffusion models.

major comments (1)
  1. [BiComp dataset construction (Section 3)] The central performance claims rest on training with the BiComp dataset, yet the manuscript provides no concrete quality-control criteria (annotation rubrics, automated filters, human verification protocol, or inter-annotator agreement statistics) for the preference pairs. Without these details it is impossible to assess whether observed improvements arise from the bimodal DPO or region guidance rather than data curation artifacts.
minor comments (2)
  1. [Abstract] Abstract contains a grammatical error: 'which is shown to greatly effective' should read 'which is shown to be greatly effective'.
  2. [Method] Notation for the bimodal preference objective and the region guidance term should be introduced with explicit equations rather than prose descriptions only.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below regarding the BiComp dataset.

read point-by-point responses
  1. Referee: [BiComp dataset construction (Section 3)] The central performance claims rest on training with the BiComp dataset, yet the manuscript provides no concrete quality-control criteria (annotation rubrics, automated filters, human verification protocol, or inter-annotator agreement statistics) for the preference pairs. Without these details it is impossible to assess whether observed improvements arise from the bimodal DPO or region guidance rather than data curation artifacts.

    Authors: We agree that the current manuscript lacks sufficient concrete details on the quality-control criteria for BiComp, which limits the ability to evaluate dataset quality independently. In the revised manuscript we will expand Section 3 with a dedicated subsection that specifies the annotation rubrics (including explicit criteria for compositional accuracy, preference ordering, and rejection rules), the automated filters (e.g., CLIP-score thresholds, object-detection consistency checks, and duplicate removal), the human verification protocol (number of annotators, qualification tests, and review workflow), and inter-annotator agreement statistics (e.g., Cohen’s kappa or percentage agreement). These additions will allow readers to assess whether gains derive from the proposed BiDPO and region guidance rather than curation artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical pipeline

full rationale

The paper constructs BiComp via an external pipeline, extends Diffusion DPO (presumably from prior independent work), adds region guidance, and reports benchmark gains. No equation or claim reduces by construction to a fitted input, self-citation chain, or renamed ansatz; results are presented as empirical outcomes on held-out benchmarks rather than tautological re-derivations of the training data or method definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the approach relies on extending existing techniques.

pith-pipeline@v0.9.1-grok · 5708 in / 963 out tokens · 29890 ms · 2026-06-29T13:13:14.752684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 28 canonical work pages · 13 internal anchors

  1. [1]

    Vismin: Visual minimal-change understanding

    Rabiul Awal, Saba Ahmadi, Le Zhang, and Aishwarya Agrawal. Vismin: Visual minimal-change understanding. ArXiv, abs/2407.16772, 2024. URLhttps://api.semanticscholar.org/CorpusID:271404384

  2. [2]

    Qwen2.5-VL Technical Report

    ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShijieWang,JunTang, HumenZhong,YuanzhiZhu,MingkunYang,ZhaohaiLi,JianqiangWan,PengfeiWang,WeiDing,ZherenFu,Yiheng Xu,JiaboYe,XiZhang,TianbaoXie,ZesenCheng,HangZhang,ZhiboYang,HaiyangXu,andJunyangLin. Qwen2.5- vl technical report.ArXiv, abs/2502.13923, 2025. URLhttps://api.semanticsc...

  3. [3]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  4. [4]

    Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACMTransactionson Graphics(TOG), 42:1 – 10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACMTransactionson Graphics(TOG), 42:1 – 10, 2023. URL https://api.semanticscholar.org/CorpusID:256416326

  5. [5]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. ArXiv, abs/2310.00426, 2023. URLhttps://api.semanticscholar.org/CorpusID:263334265

  6. [6]

    Gentron: Diffusion transformers for image and video generation.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 6441–6451, 2023

    Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Pérez-Rúa. Gentron: Diffusion transformers for image and video generation.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 6441–6451, 2023. URLhttps: //api.semanticscholar.org/CorpusID:266053134

  7. [7]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bing-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaji...

  8. [8]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

  9. [9]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conferenceonmachinelearning, 2024

  10. [10]

    Dimba: Transformer- mamba diffusion models

    ZhengcongFei,MingyuanFan,ChangqianYu,DebangLi,YouqiangZhang,andJunshiHuang. Dimba: Transformer- mamba diffusion models. ArXiv, abs/2406.01159, 2024. URL https://api.semanticscholar.org/CorpusID: 270217205

  11. [11]

    Ranni: Taming text-to-image diffusion for accurate instruction following.2024IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 4744–4753, 2023

    Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, and Jingren Zhou. Ranni: Taming text-to-image diffusion for accurate instruction following.2024IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 4744–4753, 2023. URLhttps://api.semanticscholar.org/CorpusID:265466135

  12. [12]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  13. [13]

    Contrafusion: Contrastively improving compositional understanding in diffusion models via fine-grained negative images

    Xu Han, Linghao Jin, Xiaofeng Liu, and Paul Pu Liang. Contrafusion: Contrastively improving compositional understanding in diffusion models via fine-grained negative images. InICLR, 2025

  14. [14]

    Mars: Mixture of auto-regressive models for fine-grained text- to-image synthesis.ArXiv, abs/2407.07614, 2024

    Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, Leilei Gan, and Hao Jiang. Mars: Mixture of auto-regressive models for fine-grained text- to-image synthesis.ArXiv, abs/2407.07614, 2024. URLhttps://api.semanticscholar.org/CorpusID:271089041

  15. [15]

    Margin-aware preference optimization for aligning diffusion models without reference.ArXiv, abs/2406.06424, 2024

    Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference.ArXiv, abs/2406.06424, 2024. URLhttps://api. semanticscholar.org/CorpusID:270371386

  16. [16]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2022

  17. [17]

    Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

  18. [18]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InNeurIPS, 2023

  19. [19]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.ArXiv, abs/2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.ArXiv, abs/2505.00703, 2025. URLhttps://api.semanticscholar.org/CorpusID:278237703

  20. [20]

    Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025. URL https://api. semanticscholar.org/CorpusID:283934609. 24

  21. [21]

    Scalablerankedpreference optimizationfortext-to-imagegeneration

    ShyamgopalKarthik,HuseyinCoskun,ZeynepAkata,S.Tulyakov,JianRen,andAnilKag. Scalablerankedpreference optimizationfortext-to-imagegeneration. ArXiv, abs/2410.18013, 2024. URLhttps://api.semanticscholar.org/ CorpusID:273532684

  22. [22]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  23. [23]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, P. Abbeel, Mohammad Ghavamzadeh,andShixiangShaneGu.Aligningtext-to-imagemodelsusinghumanfeedback. ArXiv,abs/2302.12192,

  24. [24]

    URLhttps://api.semanticscholar.org/CorpusID:257102772

  25. [25]

    Calibrated multi-preference optimization for aligning diffusion models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025

    Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025. URLhttps://api. semanticscholar.org/CorpusID:276107227

  26. [26]

    Playground v2

    Daiqing Li, Aleks Kamko, Ali Sabet, Ehsan Akhgari, Linmiao Xu, and Suhail Doshi. Playground v2. URL[https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic](https://huggingface. co/playgroundai/playground-v2-1024px-aesthetic)

  27. [27]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024. URL https://api.semanticscholar.org/CorpusID:268033039

  28. [28]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

  29. [29]

    Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations.ArXiv, abs/2412.08580, 2024

    Zejian Li, Chen Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhi-Yuan Yang, Jinxiong Chang, and Lingyun Sun. Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations.ArXiv, abs/2412.08580, 2024. URLhttps://api.semanticscholar. org/CorpusID:274638337

  30. [30]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Trans

    Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Trans. Mach. Learn. Res., 2024, 2023. URL https://api.semanticscholar.org/CorpusID:258841035

  31. [31]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.arXiv preprintarXiv:2305.13655, 2023

    Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.arXiv preprintarXiv:2305.13655, 2023

  32. [32]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. 2025IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 13199–13208, 2024. URL https://api.semanticscholar.org/CorpusID:270285804

  33. [33]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    ShilongLiu,ZhaoyangZeng,TianheRen,FengLi,HaoZhang,JieYang,ChunyueLi,JianweiYang,HangSu,Jun-Juan Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In EuropeanConferenceonComputerVision,2023. URL https://api.semanticscholar.org/CorpusID:257427307

  34. [34]

    Eclipse: A resource-efficient text-to-imagepriorforimagegenerations

    Maitreya Patel, Chang Soo Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-imagepriorforimagegenerations. 2024IEEE/CVFConferenceonComputerVisionandPatternRecognition (CVPR), pages 9069–9078, 2023. URLhttps://api.semanticscholar.org/CorpusID:266149498

  35. [35]

    Enhancing image layout control with loss-guided diffusion models.2025IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3916–3924, 2024

    Zakaria Patel and Kirill Serkh. Enhancing image layout control with loss-guided diffusion models.2025IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3916–3924, 2024. URLhttps://api. semanticscholar.org/CorpusID:269982837

  36. [36]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceon computervision, 2023

  37. [37]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023. URL https://api.semanticscholar.org/CorpusID:259341735. 25

  38. [38]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.ArXiv, abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321

  39. [39]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya K. Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross B. Girshick, Piotr Doll’ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.ArXiv, abs/2408.00714,...

  40. [40]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280

  41. [41]

    Joty, and Nikhil Naik

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq R. Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2023. URL https://api.semanticscholar.org/Corpus...

  42. [42]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...

  43. [43]

    Instancediffusion: Instance- level control for image generation

    Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  44. [44]

    Tokencompose: Text-to-image diffusion with token-level supervision.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 8553–8564, 2023

    Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Text-to-image diffusion with token-level supervision.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 8553–8564, 2023. URLhttps://api.semanticscholar.org/CorpusID:265723245

  45. [45]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Da-Wei Liu, De mei Li, Hang Zhang, Hao Meng, Hu Wei, Ji-Li Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Min Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wens...

  46. [46]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.ArXiv, abs/2306.09341,

  47. [47]

    URLhttps://api.semanticscholar.org/CorpusID:259171771

  48. [48]

    Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion.2023IEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 7418–7427, 2023

    Jinheng Xie, Yuexiang Li, YawenHuang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike ZhengShou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion.2023IEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 7418–7427, 2023. URLhttps://api.semanticscholar.org/CorpusID:259991581

  49. [49]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.ArXiv, abs/2408.12528, 2024. URLhttps://api.semanticscholar.org/CorpusID:271924334

  50. [50]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.ArXiv, abs/2304.05977, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.ArXiv, abs/2304.05977, 2023. URL https://api.semanticscholar.org/CorpusID:258079316

  51. [51]

    Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms.ArXiv, abs/2401.11708, 2024

    Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms.ArXiv, abs/2401.11708, 2024. URL https://api.semanticscholar.org/CorpusID:267068823

  52. [52]

    Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation.ArXiv, abs/2412.03859, 2024

    Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, and Yu-Gang Jiang. Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation.ArXiv, abs/2412.03859, 2024. URLhttps://api.semanticscholar.org/CorpusID:274514668. 26

  53. [53]

    Diffusionmodelasanoise-awarelatentrewardmodelforstep-levelpreferenceoptimization

    Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusionmodelasanoise-awarelatentrewardmodelforstep-levelpreferenceoptimization. ArXiv, abs/2502.01051,

  54. [54]

    URLhttps://api.semanticscholar.org/CorpusID:276094548

  55. [55]

    Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation

    Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation. InICLR, 2025

  56. [56]

    Reasongen-r1: Cot for autoregressive image generation models through sft and rl.ArXiv, abs/2505.24875, 2025

    Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, and Lili Qiu. Reasongen-r1: Cot for autoregressive image generation models through sft and rl.ArXiv, abs/2505.24875, 2025. URLhttps://api.semanticscholar.org/CorpusID:279070833

  57. [57]

    Huaisheng Zhu, Teng Xiao, and V.G. Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InInternational Conferenceon Learning Representations, 2025. URLhttps://api.semanticscholar. org/CorpusID:277678013

  58. [58]

    Lumina-next: Makinglumina-t2xstrongerandfaster with next-dit.ArXiv, abs/2406.18583, 2024

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang,ZiweiLiu,YuJiaoQiao,HongshengLi,andPengGao. Lumina-next: Makinglumina-t2xstrongerandfaster with next-dit.ArXiv, abs/2406.18583, 2024. URLhtt...