arxiv: 2604.24171 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

Yaohou Fan , Qingzhong Wang , Yongsong Huang , Junyi Liu , Tomo Miyazaki , Shinichiro Omachi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual text generationPareto optimizationcurriculum learningreinforcement learningmulti-reward alignmentimage coherencetext accuracy

0 comments

The pith

POCA improves visual text generation by identifying Pareto-optimal reward sets and sequencing prompts from easy to hard via automatic difficulty assessment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that multi-reward reinforcement learning for visual text generation becomes stable and effective when the problem is reframed as finding a Pareto-optimal set in unified reward space rather than using weighted sums, combined with an adaptive curriculum that orders training prompts by automatically assessed difficulty. A sympathetic reader would care because existing approaches force unstable trade-offs between accurate rendered text and coherent images, require manual reward weighting, and either demand large prompt sets or fail with small ones. POCA claims to solve the selection and balancing issues together so that limited-data training converges to better overall performance. If correct, this turns conflicting objectives into a structured easy-to-hard path without scalarization losses.

Core claim

POCA addresses the multi-objective alignment problem by identifying the Pareto-optimal set to avoid simple scalarization and by designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape.

What carries the argument

The Pareto-optimal set identified in a unified reward space together with an adaptive curriculum strategy that orders prompts by automatic difficulty assessment.

Load-bearing premise

Automatically assessing prompt difficulty and constructing a Pareto-optimal set in a unified reward space will produce stable convergence and superior trade-offs in a limited-data reinforcement learning setting for text generation.

What would settle it

An experiment in which POCA fails to produce higher CLIP scores, HPS scores, and sentence accuracy than standard weighted-sum reinforcement learning baselines when both use the same limited prompt set would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.24171 by Junyi Liu, Qingzhong Wang, Shinichiro Omachi, Tomo Miyazaki, Yaohou Fan, Yongsong Huang.

**Figure 1.** Figure 1: Illustration of the trade-off between image coher view at source ↗

**Figure 2.** Figure 2: Inspection of GRPO-based methods in multi-reward view at source ↗

**Figure 3.** Figure 3: Overview of POCA. Our framework operates in two stages. The Curriculum Planning stage assesses prompt difficulty and organizes the data into an ”easy-to-hard” sequence. This sequence is then fed to the Training stage, which uses bi-directional Pareto sorting to filter for the Pareto set for optimization. sive human-annotated preference data to supply dense preference signals that guide policy explorations… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of POCA and other methods on the AnyText-benchmark. view at source ↗

**Figure 8.** Figure 8: Human study results in terms of text accuracy, prompt following, and aesthetics, measured by human preference win rates. 5.3. Qualitative Analysis Comparison with Baselines: We further demonstrate the effectiveness of POCA on both Chinese and English prompts, using test cases from the AnyTextbenchmark [34] view at source ↗

**Figure 6.** Figure 6: Reward curve comparison. The GRPO baseline uses weighted sum approach to aggregate OCR, CLIP and HPS scores in a ratio of 1,1,1, which leads to suboptimal results in the OCR and CLIP rewards compared to our Pareto-guided approach. English benchmark (a) Sentence accuracy (b) CLIP score (c) HPS score Chinese benchmark GRPO baseline Pareto-guided POCA view at source ↗

**Figure 7.** Figure 7: Performance comparisons of POCA, the Pareto view at source ↗

**Figure 9.** Figure 9: Contribution to the global Pareto front. view at source ↗

**Figure 10.** Figure 10: Reward curves for different Pareto selection algorithms. view at source ↗

**Figure 11.** Figure 11: ECDFs for both the mean reward scores and the reward variances of OCR, CLIP, and HPS across the full training set. Vertical dashed lines indicate deciles (10%–100% in steps of 10%). 10. More Details About Dataset Preparation Our image dataset is randomly sampled from the following datasets: 2 view at source ↗

**Figure 12.** Figure 12: Examples of the diverse image domains in our image dataset. 11. POCA on Larger Model Focusing on visual text generation, we use the state-ofthe-art AnyText for our main experiments. To demonstrate that POCA is model-agnostic, we also evaluate it using the more recent Glyph-SDXL-v2 view at source ↗

**Figure 13.** Figure 13: Rule-based instructions utilized with the Gemini 2.5 model to generate high-fidelity prompts for the POCA-20k dataset. other baselines in view at source ↗

**Figure 14.** Figure 14: General qualitative comparison of POCA and other baselines. view at source ↗

**Figure 15.** Figure 15: Qualitative comparison with standard GRPO baseline. view at source ↗

**Figure 16.** Figure 16: Evaluation of complex instruction-following capability. view at source ↗

read the original abstract

Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POCA frames multi-reward RL for visual text generation with Pareto sets and prompt curriculum, but the abstract gives no experiments or numbers to check the claimed gains.

read the letter

The main thing here is that POCA tries to fix the usual problems in visual text generation by treating it as a multi-objective RL task: it builds a Pareto-optimal set in unified reward space instead of a weighted sum, then adds an adaptive curriculum that orders prompts by automatic difficulty assessment. This is meant to remove conflicting gradient signals and stabilize training when you only have a limited set of prompts. The framing makes sense on the surface because text accuracy often pulls against image quality and instruction following, and picking the right training examples matters for RL efficiency. The paper does a solid job spelling out those tensions and why scalarization falls short. It also gives credit to the practical constraints of limited data and compute in this domain. That part feels honest and grounded in the actual use case. The soft spots are straightforward. The abstract asserts clear wins on CLIP, HPS, and sentence accuracy, yet supplies zero experimental details, baselines, ablations, or even rough numbers. Without seeing how the Pareto front is actually computed, how difficulty is scored, or what the training curves look like, the central claim stays untestable. The RL instability issue is acknowledged but the proposed fixes are described at a high level only, so it is hard to judge whether they really deliver stable convergence or just shift the problem. This is the kind of work that would interest people building text-in-image models or anyone doing multi-objective RL in vision with small prompt sets. A reader who already knows the trade-off literature could pick up the specific combination of Pareto identification plus curriculum and see if it fits their setup. I would send it for peer review so the experiments can be checked properly; the idea is coherent enough that it deserves a look at the full evidence rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes Pareto-Optimal Curriculum Alignment (POCA), a multi-objective reinforcement learning framework for visual text generation. It constructs an explicit Pareto-optimal set in a unified reward space to avoid instabilities from weighted-sum scalarization of rewards (text accuracy, aesthetic quality, instruction following), and pairs this with an adaptive curriculum that automatically assesses prompt difficulty to sequence training from easy to hard in a limited-data regime. The central claim is that this combination yields stable convergence and significant gains on CLIP score, HPS, and sentence accuracy.

Significance. If the empirical claims are substantiated with proper controls, POCA could meaningfully advance multi-reward alignment techniques for generative models by replacing ad-hoc scalarization with Pareto optimization and using curriculum learning to stabilize training under prompt scarcity. The approach directly targets documented issues of inconsistent gradients and poor prompt selection in RL for text-to-image tasks, offering a conceptually coherent alternative whose value hinges on rigorous validation.

major comments (2)

[Abstract and experimental results section] Abstract and experimental results section: The manuscript asserts that POCA 'significantly improves all metrics such as CLIP, HPS scores and sentence accuracy,' yet supplies no baselines, ablation studies, quantitative tables, statistical tests, or experimental protocol. This absence makes it impossible to evaluate whether the Pareto set or curriculum components drive the claimed gains, which is load-bearing for the paper's primary contribution.
[Method description (high-level strategy paragraphs)] Method description (high-level strategy paragraphs): The construction of the Pareto-optimal set in unified reward space and the automatic difficulty assessment for curriculum ordering are presented only conceptually, with no equations, pseudocode, or algorithmic details. Without these, it cannot be verified that the method eliminates inconsistent signals or produces the claimed easy-to-hard optimization landscape.

minor comments (1)

[Abstract] The abstract refers to 'a set of training instructions' and 'limited data environment' without indicating the prompt count, selection criteria, or RL algorithm used; adding these specifics would improve context even if full experiments are added later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas where additional rigor and detail are needed to substantiate our claims. We have revised the manuscript to address both major comments by expanding the experimental section and providing formal method details.

read point-by-point responses

Referee: [Abstract and experimental results section] Abstract and experimental results section: The manuscript asserts that POCA 'significantly improves all metrics such as CLIP, HPS scores and sentence accuracy,' yet supplies no baselines, ablation studies, quantitative tables, statistical tests, or experimental protocol. This absence makes it impossible to evaluate whether the Pareto set or curriculum components drive the claimed gains, which is load-bearing for the paper's primary contribution.

Authors: We acknowledge that the original submission presented only high-level claims without supporting experimental details. In the revised manuscript, we have added a dedicated experimental results section that includes: (1) multiple baselines such as standard multi-reward RL using weighted-sum scalarization, single-reward PPO, and non-curriculum variants; (2) ablation studies that isolate the Pareto-optimal set construction from the adaptive curriculum; (3) quantitative tables reporting CLIP, HPS, and sentence accuracy scores with means and standard deviations across multiple runs; (4) statistical significance tests (paired t-tests with p-values); and (5) a full experimental protocol detailing hyperparameters, prompt sets, and training procedures. These additions allow direct assessment of each component's contribution. revision: yes
Referee: [Method description (high-level strategy paragraphs)] Method description (high-level strategy paragraphs): The construction of the Pareto-optimal set in unified reward space and the automatic difficulty assessment for curriculum ordering are presented only conceptually, with no equations, pseudocode, or algorithmic details. Without these, it cannot be verified that the method eliminates inconsistent signals or produces the claimed easy-to-hard optimization landscape.

Authors: We agree the original method description remained at a conceptual level. The revised manuscript now includes: (1) formal equations defining the unified reward space and the Pareto dominance relation used to construct the explicit Pareto-optimal set; (2) the multi-objective optimization objective that avoids weighted-sum scalarization; (3) pseudocode for the Pareto set extraction algorithm; and (4) pseudocode plus equations for the adaptive curriculum, including the automatic difficulty scoring function based on prompt complexity features and the sequencing logic that enforces easy-to-hard progression. These additions mathematically ground how inconsistent gradient signals are mitigated and how the curriculum shapes the optimization landscape. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes POCA at the level of a high-level algorithmic framework: constructing a Pareto-optimal set in unified reward space to avoid scalarization, combined with an adaptive curriculum that orders prompts by automatic difficulty assessment. No equations, derivations, or parameter-fitting steps are presented that reduce any claimed prediction or result to its own inputs by construction. The central claims rest on the empirical observation that this combination yields better CLIP/HPS/sentence-accuracy trade-offs; those outcomes are not shown to be tautological with the method definition itself. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing justifications. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proposal rests on standard assumptions from multi-objective optimization and curriculum learning in RL; no new entities or fitted parameters are introduced in the abstract.

axioms (2)

domain assumption A Pareto-optimal set can be identified in a unified multi-reward space to eliminate inconsistent signals
Invoked when describing how POCA finds the best trade-off solution.
domain assumption Automatic difficulty assessment can produce an effective easy-to-hard learning sequence for RL convergence
Central to the adaptive curriculum alignment strategy.

pith-pipeline@v0.9.0 · 5567 in / 1211 out tokens · 39580 ms · 2026-05-08T04:34:17.058380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Multi-objective optimum design concepts and methods.Introduction to optimum design, pages 657– 679, 2012

Jasbir S Arora. Multi-objective optimum design concepts and methods.Introduction to optimum design, pages 657– 679, 2012. 2

2012
[2]

Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Sys- tems, 36:9353–9387, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Sys- tems, 36:9353–9387, 2023. 6, 7

2023
[3]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Confer- ence on Computer Vision, pages 386–402. Springer, 2024. 2

2024
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 6, 3

work page internal anchor Pith review arXiv 2025
[5]

Curriculum direct prefer- ence optimization for diffusion and consistency models

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, and Mubarak Shah. Curriculum direct prefer- ence optimization for diffusion and consistency models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2824–2834, 2025. 3

2025
[6]

What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651,

Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651, 2025. 5

work page arXiv 2025
[7]

Postermaker: Towards high-quality product poster generation with accurate text rendering

Yifan Gao, Zihang Lin, Chuanbin Liu, Min Zhou, Tiezheng Ge, Bo Zheng, and Hongtao Xie. Postermaker: Towards high-quality product poster generation with accurate text rendering. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8083–8093, 2025. 1

2025
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3, 4

work page internal anchor Pith review arXiv 2025
[9]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 2315–2324, 2016. 6, 3

2016
[10]

On the power of cur- riculum learning in training deep networks

Guy Hacohen and Daphna Weinshall. On the power of cur- riculum learning in training deep networks. InInternational conference on machine learning, pages 2535–2544. PMLR,
[11]

Improving diffusion models for scene text editing with dual encoders

Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, and Shiyu Chang. Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568, 2023. 2

work page arXiv 2023
[12]

Any non-welfarist method of policy assessment violates the pareto principle

Louis Kaplow and Steven Shavell. Any non-welfarist method of policy assessment violates the pareto principle. Journal of Political Economy, 109(2):281–286, 2001. 2

2001
[13]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36: 36652–36663, 2023. 3

2023
[14]

Adap- tive curriculum learning

Yajing Kong, Liu Liu, Jun Wang, and Dacheng Tao. Adap- tive curriculum learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5067– 5076, 2021. 3

2021
[15]

Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025. 2

work page arXiv 2025
[16]

Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Jun- feng He, et al. Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation. In European Conference on Computer Vision, pages 462–478. Springer, 2024. 4, 1

2024
[17]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 3

work page internal anchor Pith review arXiv 2025
[18]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025. 5

work page arXiv 2025
[19]

Pareto set learning for expensive multi-objective optimiza- tion.Advances in Neural Information Processing Systems, 35:19231–19247, 2022

Xi Lin, Zhiyuan Yang, Xiaoyuan Zhang, and Qingfu Zhang. Pareto set learning for expensive multi-objective optimiza- tion.Advances in Neural Information Processing Systems, 35:19231–19247, 2022. 4

2022
[20]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3

2023
[21]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 1, 3

work page internal anchor Pith review arXiv 2025
[22]

Character-aware models improve visual text rendering.arXiv preprint arXiv:2212.10562, 2022

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mo- hammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering.arXiv preprint arXiv:2212.10562, 2022. 2

work page arXiv 2022
[23]

Glyph-byt5: A customized text encoder for accurate visual text rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer,
[24]

arXiv preprint arXiv:2406.10208 , year=

Zeyu Liu et al. Glyph-byt5-v2: A strong aesthetic base- line for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024. 7

work page arXiv 2024
[25]

Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5955–5963, 2025. 1, 6, 7

2025
[26]

Subject-driven text-to-image generation via preference-based reinforcement learning.Advances in Neural Information Processing Sys- tems, 37:123563–123591, 2024

Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, and Yeqing Li. Subject-driven text-to-image generation via preference-based reinforcement learning.Advances in Neural Information Processing Sys- tems, 37:123563–123591, 2024. 2, 8, 3

2024
[27]

Boost your human image generation model via direct preference 9 optimization

Sanghyeon Na, Yonggyu Kim, and Hyunjoon Lee. Boost your human image generation model via direct preference 9 optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23551–23562, 2025. 3

2025
[28]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007. 3

2007
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6

2021
[30]

Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 3

2023
[31]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 2

2020
[32]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 1

work page internal anchor Pith review arXiv 2025
[33]

Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recog- nition and its application to scene text recognition.IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016. 6

2016
[34]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text gen- eration and editing.arXiv preprint arXiv:2311.03054, 2023. 1, 2, 6, 7, 3

work page arXiv 2023
[35]

Anytext2: Vi- sual text generation and editing with customizable attributes

Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Vi- sual text generation and editing with customizable attributes. arXiv preprint arXiv:2411.15245, 2024. 6, 7

work page arXiv 2024
[36]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 3

2024
[37]

Describing like hu- mans: on diversity in image captioning

Qingzhong Wang and Antoni B Chan. Describing like hu- mans: on diversity in image captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4195–4203, 2019. 3

2019
[38]

On diversity in image captioning: Metrics and methods.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 44(2): 1035–1049, 2020

Qingzhong Wang, Jia Wan, and Antoni B Chan. On diversity in image captioning: Metrics and methods.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 44(2): 1035–1049, 2020. 3

2020
[39]

A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021. 3

2021
[40]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1

work page internal anchor Pith review arXiv 2025
[41]

When do curricula work?arXiv preprint arXiv:2012.03107,

Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur. When do curricula work?arXiv preprint arXiv:2012.03107, 2020. 3

work page arXiv 2012
[42]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review arXiv
[43]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3

2023
[44]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 1, 2, 3, 7

work page internal anchor Pith review arXiv 2025
[45]

Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066,

Yukang Yang, Dongnan Gui, Yuhui Yuan, Weicong Liang, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066,
[46]

Pareto optimization for active learning under out-of-distribution data scenarios

Xueying Zhan, Zeyu Dai, Qingzhong Wang, Qing Li, Haoyi Xiong, Dejing Dou, and Antoni B Chan. Pareto optimization for active learning under out-of-distribution data scenarios. Transactions on Machine Learning Research, 2022. 3

2022
[47]

A compar- ative survey of deep active learning.arXiv preprint arXiv:2203.13450, 2022

Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and Antoni B Chan. A compar- ative survey of deep active learning.arXiv preprint arXiv:2203.13450, 2022. 3

work page arXiv 2022
[48]

How control information influences multilingual text image generation and editing?Advances in Neural Information Processing Systems, 37:6884–6904, 2024

Boqiang Zhang, Zuan Gao, Yadong Qu, and Hongtao Xie. How control information influences multilingual text image generation and editing?Advances in Neural Information Processing Systems, 37:6884–6904, 2024. 6, 7

2024
[49]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 6, 7

2023
[50]

Brush your text: Synthesize any scene text on im- ages via diffusion model

Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, and Yu Qiao. Brush your text: Synthesize any scene text on im- ages via diffusion model. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7215–7223, 2024. 2

2024
[51]

arXiv preprint arXiv:2503.21749 , year=

Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, et al. Lex-art: Rethinking text generation via scalable high-quality data synthesis.arXiv preprint arXiv:2503.21749, 2025. 6, 3

work page arXiv 2025
[52]

Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models

Yiming Zhao and Zhouhui Lian. Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models. InEuropean conference on computer vision, pages 217–233. Springer, 2024. 1, 2 10 POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation Supplementary Material

2024
[53]

The allowed operations include character insertions, deletions, and substitutions, each contributing a unit cost to the total edit distance

Normalized Edit Distance Edit Distance (ED), also known as Levenshtein distance, measures the minimum number of operations required to transform one string into another. The allowed operations include character insertions, deletions, and substitutions, each contributing a unit cost to the total edit distance. This metric is widely used for text similarity...
[54]

Additionally, we investigate the impact of negative samples by evaluating a fully domi- nated sorting baseline

Pareto Set Comparison To validate the effectiveness of our bi-directional strategy, we compare it against Parrot [16], which employs one- directional non-dominated sorting to update the policy us- ing only the best samples. Additionally, we investigate the impact of negative samples by evaluating a fully domi- nated sorting baseline. Following the configu...
[55]

For each reward model, we compute scores for every sample and plot the ECDFs together with selected quantiles, as shown in Fig

Variance Analysis of Reward Models To justify our choice of the OCR reward as the difficulty measure in the curriculum, we compare the distributions of all three reward signals (OCR, CLIP, and HPS) over the full training set. For each reward model, we compute scores for every sample and plot the ECDFs together with selected quantiles, as shown in Fig. 11....
[56]

More Details About Dataset Preparation Our image dataset is randomly sampled from the following datasets: 2 Table 4.Comparisons between POCA and the counterparts. Methods English Chinese Sen.ACC↑NED↑CLIP score↑HPS score↑ Sen.ACC↑NED↑CLIP score↑HPS score↑ RPO-Harmonic 0.7400 0.88750.90290.2678 0.6908 0.8684 0.81550.2672 Curriculum-DPO 0.7268 0.8866 0.8962 ...

work page arXiv
[57]

To demon- strate that POCA is model-agnostic, we also evaluate it using the more recent Glyph-SDXL-v2

POCA on Larger Model Focusing on visual text generation, we use the state-of- the-art AnyText for our main experiments. To demon- strate that POCA is model-agnostic, we also evaluate it using the more recent Glyph-SDXL-v2. Table 4 shows the results. Obviously, using Bi-directional Pareto sorting (Pareto-guided-SDXL) can significantly improve the per- form...
[58]

More Comparisons with Related Works In this section, we further compare POCA with addi- tional related methods, including 1) the weighted-sum ap- proach RPO [26] and 2) the DPO-based curriculum design, Curriculum-DPO [5]. RPO proposes using the harmonic mean instead of a naive weighted-sum approach to aggregate different re- wards in a two-reward setting,...
[59]

Un- like RPO and other weighted-sum methods, POCA avoids the difficulty of balancing aggregation hyperparameters

As shown in Table 4, POCA outper- forms RPO on multiple metrics, especially Sen.ACC. Un- like RPO and other weighted-sum methods, POCA avoids the difficulty of balancing aggregation hyperparameters. While Curriculum-DPO builds an easy-to-hard learning path by ranking candidate samples with a single reward model and progressively training on preference pai...
[60]

Performing inference on the entire set of prompts for difficulty measurement requires∼15 hours with 8 GPUs

Assessment of computational overhead: Generating the 20k training prompts using Gemini 2.5 took∼40 hours. Performing inference on the entire set of prompts for difficulty measurement requires∼15 hours with 8 GPUs
[61]

We first provide more examples for comparison with 3 Instructions for Prompt Generation Input: <Image>, <Text>You are an expert prompt engineer for Stable Diffusion

More Visual Examples We show additional visual examples of POCA in this sec- tion. We first provide more examples for comparison with 3 Instructions for Prompt Generation Input: <Image>, <Text>You are an expert prompt engineer for Stable Diffusion. Analyze the provided image and generate a high-quality that works for SD1.5 based model. Critical Rule:1. Th...