Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
Pith reviewed 2026-06-29 13:13 UTC · model grok-4.3
The pith
BiDPO jointly optimizes image and text preferences with region guidance to raise compositional fidelity in text-to-image models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building the BiComp preference dataset and extending Diffusion DPO to jointly optimize image and text preferences under region-aware guidance, BiDPO substantially raises the compositional fidelity of text-to-image generation and consistently outperforms prior methods on multiple benchmarks.
What carries the argument
BiDPO, the bimodal extension of Diffusion DPO that adds joint image-text preference optimization and region-level guidance applied to the BiComp dataset.
If this is right
- Models fine-tuned with BiDPO achieve higher accuracy on prompts involving attribute bindings, object relationships, and counting.
- Region-level guidance produces finer alignment between text concepts and specific image areas than global preference signals alone.
- Joint image-and-text preference optimization proves more effective for complex prompt following than single-modality DPO variants.
- The overall pipeline offers a flexible, scalable alternative to architectural modifications for compositional text-to-image tasks.
Where Pith is reading between the lines
- The dataset-construction pipeline could be reused or adapted for other generation domains that need quality-controlled preference pairs.
- If the gains hold, similar bimodal preference tuning might reduce reliance on post-generation correction or prompt engineering in production image systems.
- The region-guidance component suggests a general route for injecting localized supervision into diffusion fine-tuning without full pixel-level labels.
Load-bearing premise
The carefully controlled pipeline used to build the BiComp preference dataset yields training data that effectively improves model behavior.
What would settle it
An evaluation in which BiDPO fails to exceed the compositional scores of prior methods on the same benchmarks would falsify the central performance claim.
read the original abstract
Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BiDPO for compositional text-to-image generation. It introduces a pipeline to build the large-scale BiComp preference dataset under strict quality control, extends Diffusion DPO to jointly optimize image and text preferences, and adds region-level guidance for fine-grained alignment. Experiments claim that BiDPO substantially improves compositional fidelity and outperforms prior methods on multiple benchmarks.
Significance. If the BiComp dataset and reported gains prove robust, the bimodal DPO extension combined with region guidance would supply a scalable, preference-based alternative to existing compositional T2I techniques, with clear potential to improve attribute binding, relations, and counting in diffusion models.
major comments (1)
- [BiComp dataset construction (Section 3)] The central performance claims rest on training with the BiComp dataset, yet the manuscript provides no concrete quality-control criteria (annotation rubrics, automated filters, human verification protocol, or inter-annotator agreement statistics) for the preference pairs. Without these details it is impossible to assess whether observed improvements arise from the bimodal DPO or region guidance rather than data curation artifacts.
minor comments (2)
- [Abstract] Abstract contains a grammatical error: 'which is shown to greatly effective' should read 'which is shown to be greatly effective'.
- [Method] Notation for the bimodal preference objective and the region guidance term should be introduced with explicit equations rather than prose descriptions only.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below regarding the BiComp dataset.
read point-by-point responses
-
Referee: [BiComp dataset construction (Section 3)] The central performance claims rest on training with the BiComp dataset, yet the manuscript provides no concrete quality-control criteria (annotation rubrics, automated filters, human verification protocol, or inter-annotator agreement statistics) for the preference pairs. Without these details it is impossible to assess whether observed improvements arise from the bimodal DPO or region guidance rather than data curation artifacts.
Authors: We agree that the current manuscript lacks sufficient concrete details on the quality-control criteria for BiComp, which limits the ability to evaluate dataset quality independently. In the revised manuscript we will expand Section 3 with a dedicated subsection that specifies the annotation rubrics (including explicit criteria for compositional accuracy, preference ordering, and rejection rules), the automated filters (e.g., CLIP-score thresholds, object-detection consistency checks, and duplicate removal), the human verification protocol (number of annotators, qualification tests, and review workflow), and inter-annotator agreement statistics (e.g., Cohen’s kappa or percentage agreement). These additions will allow readers to assess whether gains derive from the proposed BiDPO and region guidance rather than curation artifacts. revision: yes
Circularity Check
No significant circularity; derivation is self-contained empirical pipeline
full rationale
The paper constructs BiComp via an external pipeline, extends Diffusion DPO (presumably from prior independent work), adds region guidance, and reports benchmark gains. No equation or claim reduces by construction to a fitted input, self-citation chain, or renamed ansatz; results are presented as empirical outcomes on held-out benchmarks rather than tautological re-derivations of the training data or method definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Vismin: Visual minimal-change understanding
Rabiul Awal, Saba Ahmadi, Le Zhang, and Aishwarya Agrawal. Vismin: Visual minimal-change understanding. ArXiv, abs/2407.16772, 2024. URLhttps://api.semanticscholar.org/CorpusID:271404384
-
[2]
ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShijieWang,JunTang, HumenZhong,YuanzhiZhu,MingkunYang,ZhaohaiLi,JianqiangWan,PengfeiWang,WeiDing,ZherenFu,Yiheng Xu,JiaboYe,XiZhang,TianbaoXie,ZesenCheng,HangZhang,ZhiboYang,HaiyangXu,andJunyangLin. Qwen2.5- vl technical report.ArXiv, abs/2502.13923, 2025. URLhttps://api.semanticsc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
2023
-
[4]
Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACMTransactionson Graphics(TOG), 42:1 – 10, 2023
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACMTransactionson Graphics(TOG), 42:1 – 10, 2023. URL https://api.semanticscholar.org/CorpusID:256416326
2023
-
[5]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. ArXiv, abs/2310.00426, 2023. URLhttps://api.semanticscholar.org/CorpusID:263334265
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Gentron: Diffusion transformers for image and video generation.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 6441–6451, 2023
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Pérez-Rúa. Gentron: Diffusion transformers for image and video generation.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 6441–6451, 2023. URLhttps: //api.semanticscholar.org/CorpusID:266053134
2024
-
[7]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bing-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaji...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conferenceonmachinelearning, 2024
2024
-
[10]
Dimba: Transformer- mamba diffusion models
ZhengcongFei,MingyuanFan,ChangqianYu,DebangLi,YouqiangZhang,andJunshiHuang. Dimba: Transformer- mamba diffusion models. ArXiv, abs/2406.01159, 2024. URL https://api.semanticscholar.org/CorpusID: 270217205
-
[11]
Ranni: Taming text-to-image diffusion for accurate instruction following.2024IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 4744–4753, 2023
Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, and Jingren Zhou. Ranni: Taming text-to-image diffusion for accurate instruction following.2024IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 4744–4753, 2023. URLhttps://api.semanticscholar.org/CorpusID:265466135
2023
-
[12]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
2023
-
[13]
Contrafusion: Contrastively improving compositional understanding in diffusion models via fine-grained negative images
Xu Han, Linghao Jin, Xiaofeng Liu, and Paul Pu Liang. Contrafusion: Contrastively improving compositional understanding in diffusion models via fine-grained negative images. InICLR, 2025
2025
-
[14]
Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, Leilei Gan, and Hao Jiang. Mars: Mixture of auto-regressive models for fine-grained text- to-image synthesis.ArXiv, abs/2407.07614, 2024. URLhttps://api.semanticscholar.org/CorpusID:271089041
-
[15]
Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference.ArXiv, abs/2406.06424, 2024. URLhttps://api. semanticscholar.org/CorpusID:270371386
-
[16]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2022
2022
-
[17]
Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
2024
-
[18]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InNeurIPS, 2023
2023
-
[19]
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.ArXiv, abs/2505.00703, 2025. URLhttps://api.semanticscholar.org/CorpusID:278237703
-
[20]
Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025
Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.ArXiv, abs/2512.16853, 2025. URL https://api. semanticscholar.org/CorpusID:283934609. 24
-
[21]
Scalablerankedpreference optimizationfortext-to-imagegeneration
ShyamgopalKarthik,HuseyinCoskun,ZeynepAkata,S.Tulyakov,JianRen,andAnilKag. Scalablerankedpreference optimizationfortext-to-imagegeneration. ArXiv, abs/2410.18013, 2024. URLhttps://api.semanticscholar.org/ CorpusID:273532684
-
[22]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[23]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, P. Abbeel, Mohammad Ghavamzadeh,andShixiangShaneGu.Aligningtext-to-imagemodelsusinghumanfeedback. ArXiv,abs/2302.12192,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URLhttps://api.semanticscholar.org/CorpusID:257102772
-
[25]
Calibrated multi-preference optimization for aligning diffusion models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025
Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025. URLhttps://api. semanticscholar.org/CorpusID:276107227
2025
-
[26]
Playground v2
Daiqing Li, Aleks Kamko, Ali Sabet, Ehsan Akhgari, Linmiao Xu, and Suhail Doshi. Playground v2. URL[https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic](https://huggingface. co/playgroundai/playground-v2-1024px-aesthetic)
-
[27]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.ArXiv, abs/2402.17245, 2024. URL https://api.semanticscholar.org/CorpusID:268033039
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023
2023
-
[29]
Zejian Li, Chen Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhi-Yuan Yang, Jinxiong Chang, and Lingyun Sun. Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations.ArXiv, abs/2412.08580, 2024. URLhttps://api.semanticscholar. org/CorpusID:274638337
-
[30]
Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Trans
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Trans. Mach. Learn. Res., 2024, 2023. URL https://api.semanticscholar.org/CorpusID:258841035
2024
-
[31]
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.arXiv preprintarXiv:2305.13655, 2023
-
[32]
Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. 2025IEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR), pages 13199–13208, 2024. URL https://api.semanticscholar.org/CorpusID:270285804
2024
-
[33]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
ShilongLiu,ZhaoyangZeng,TianheRen,FengLi,HaoZhang,JieYang,ChunyueLi,JianweiYang,HangSu,Jun-Juan Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In EuropeanConferenceonComputerVision,2023. URL https://api.semanticscholar.org/CorpusID:257427307
2023
-
[34]
Eclipse: A resource-efficient text-to-imagepriorforimagegenerations
Maitreya Patel, Chang Soo Kim, Sheng Cheng, Chitta Baral, and Yezhou Yang. Eclipse: A resource-efficient text-to-imagepriorforimagegenerations. 2024IEEE/CVFConferenceonComputerVisionandPatternRecognition (CVPR), pages 9069–9078, 2023. URLhttps://api.semanticscholar.org/CorpusID:266149498
2023
-
[35]
Enhancing image layout control with loss-guided diffusion models.2025IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3916–3924, 2024
Zakaria Patel and Kirill Serkh. Enhancing image layout control with loss-guided diffusion models.2025IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3916–3924, 2024. URLhttps://api. semanticscholar.org/CorpusID:269982837
2024
-
[36]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceon computervision, 2023
2023
-
[37]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023. URL https://api.semanticscholar.org/CorpusID:259341735. 25
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.ArXiv, abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya K. Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross B. Girshick, Piotr Doll’ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.ArXiv, abs/2408.00714,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280
2022
-
[41]
Joty, and Nikhil Naik
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq R. Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2023. URL https://api.semanticscholar.org/Corpus...
2024
-
[42]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Instancediffusion: Instance- level control for image generation
Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024
2024
-
[44]
Tokencompose: Text-to-image diffusion with token-level supervision.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 8553–8564, 2023
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Text-to-image diffusion with token-level supervision.2024 IEEE/CVF Conferenceon ComputerVisionand PatternRecognition(CVPR), pages 8553–8564, 2023. URLhttps://api.semanticscholar.org/CorpusID:265723245
2024
-
[45]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Da-Wei Liu, De mei Li, Hang Zhang, Hao Meng, Hu Wei, Ji-Li Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Min Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wens...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.ArXiv, abs/2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
URLhttps://api.semanticscholar.org/CorpusID:259171771
-
[48]
Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion.2023IEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 7418–7427, 2023
Jinheng Xie, Yuexiang Li, YawenHuang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike ZhengShou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion.2023IEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 7418–7427, 2023. URLhttps://api.semanticscholar.org/CorpusID:259991581
2023
-
[49]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.ArXiv, abs/2408.12528, 2024. URLhttps://api.semanticscholar.org/CorpusID:271924334
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.ArXiv, abs/2304.05977, 2023. URL https://api.semanticscholar.org/CorpusID:258079316
-
[51]
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms.ArXiv, abs/2401.11708, 2024. URL https://api.semanticscholar.org/CorpusID:267068823
-
[52]
Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, and Yu-Gang Jiang. Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation.ArXiv, abs/2412.03859, 2024. URLhttps://api.semanticscholar.org/CorpusID:274514668. 26
-
[53]
Diffusionmodelasanoise-awarelatentrewardmodelforstep-levelpreferenceoptimization
Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusionmodelasanoise-awarelatentrewardmodelforstep-levelpreferenceoptimization. ArXiv, abs/2502.01051,
-
[54]
URLhttps://api.semanticscholar.org/CorpusID:276094548
-
[55]
Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation
Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation. InICLR, 2025
2025
-
[56]
Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, and Lili Qiu. Reasongen-r1: Cot for autoregressive image generation models through sft and rl.ArXiv, abs/2505.24875, 2025. URLhttps://api.semanticscholar.org/CorpusID:279070833
-
[57]
Huaisheng Zhu, Teng Xiao, and V.G. Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InInternational Conferenceon Learning Representations, 2025. URLhttps://api.semanticscholar. org/CorpusID:277678013
2025
-
[58]
Lumina-next: Makinglumina-t2xstrongerandfaster with next-dit.ArXiv, abs/2406.18583, 2024
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang,ZiweiLiu,YuJiaoQiao,HongshengLi,andPengGao. Lumina-next: Makinglumina-t2xstrongerandfaster with next-dit.ArXiv, abs/2406.18583, 2024. URLhtt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.