pith. machine review for the scientific record. sign in

arxiv: 2605.13155 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Pareto-Guided Optimal Transport for Multi-Reward Alignment

Bing Su, Guiwei Zhang, Ji-Rong Wen, Mohan Zhou, Tianyu Zhang, Wenyi Mo, Yalong Bai, Ying Ba

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords rewardmodelshackingmulti-rewardoptimaloptimizationratetransport
0
0 comments X

The pith

PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image AI models improve by optimizing against reward models that score how well images match human preferences. When several rewards are used together they often conflict, and simply adding them with weights requires lots of trial-and-error tuning. Worse, the model can increase the reward scores while actually producing worse-looking images, a problem called reward hacking. The new method first finds, for each text prompt, the set of best possible trade-off points among the rewards; this set is called the Pareto frontier. It then uses optimal transport, a technique that finds the cheapest way to move one group of points to another while respecting the overall shape of the distribution, to push generated images toward those good frontier points. Separate online and offline versions handle cases where reward signals are available during generation or only afterward. To judge success the authors define Joint Domination Rate, which counts how often one method beats others across all rewards at once, and Joint Collapse Rate, which detects when rewards are being gamed. Experiments report an 11 percent lift in JDR and an 80 percent human preference win rate over baselines.

Core claim

Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

Load-bearing premise

That a prompt-specific Pareto frontier can be constructed reliably from the available reward models and that mapping samples to it via optimal transport will consistently reduce reward hacking without introducing new instabilities or excessive compute cost.

Figures

Figures reproduced from arXiv: 2605.13155 by Bing Su, Guiwei Zhang, Ji-Rong Wen, Mohan Zhou, Tianyu Zhang, Wenyi Mo, Yalong Bai, Ying Ba.

Figure 1
Figure 1. Figure 1: Empirical validation of heterogeneous prompt-wise re￾ward upper bounds under the ICT reward (Ba et al., 2025). Reward distributions are estimated from 50 samples per prompt across 20 distinct prompts. and propose methods to verify our hypothesis. Problem Setup. We study post-training preference opti￾mization for text-to-image (T2I) generative models using reward models. Let P = {p1, . . . , pn} denote a se… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Comparison of Optimization Results Across Different Methods. x j i whose reward vector is in the Pareto frontier, and the γ is a n × qi transport matrix. Training Objective. In practice, R˜(x) is computed by re￾ward models on the generated image x. We optimize the T2I model parameters by backpropagating the OT-based loss through the reward computation, following the differ￾entiable reward optim… view at source ↗
Figure 3
Figure 3. Figure 3: Pareto Frontier Visualization based on strong rewards (ICT and HP) on Three Prompts [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative Training Curves of Joint Domination Rate (JDR4) for Ours versus Baseline Methods. performance and instability. Quantitative results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adaptive Decision Pipeline of the VLM based Agent for Multi-Reward Optimization. D. Details VLM Based Decision-making Agent We introduce VLM as a decision-making agent to adaptively manage multi-reward model training. The agent dynamically determines actions based on generated image quality and training stability, with three core capabilities: • Continue Training — When no signs of collapse are observed in… view at source ↗
Figure 6
Figure 6. Figure 6: Broad Comparative Examples of Pareto Frontier Visualizations for Various Methods. joint domination rate, indicating that the baseline approaches quickly encounter reward hacking issues. Qualitative Case Studies on Pareto Frontier Visualization. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Box Plots Showing Reward Variations Across Prompts on HP Score [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training Curves of Joint Domination Rate (JDR2) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Comparison of Optimization Results with Single-Reward Baselines. 10 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative Comparison of Optimization Results with Multi-Reward Baselines. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Pareto Frontier-Guided Optimal Transport (PG-OT) for multi-reward alignment in text-to-image models. It constructs a prompt-specific Pareto frontier from available reward models and maps generated samples to this frontier via distribution-aware optimal transport, with online and offline optimization variants. New metrics Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) are introduced to quantify multi-reward synergy and reward hacking. Experiments report an 11% JDR gain over baselines and a near-80% win rate in human evaluations.

Significance. If the central claims hold, the framework offers a principled alternative to weighted-sum reward fusion that directly targets non-dominated trade-offs, which could improve robustness to reward hacking in heterogeneous multi-objective settings. The JDR/JCR metrics provide a more structured evaluation lens than single-reward scores.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods: The 11% JDR gain and ~80% human win rate rest on reliable construction of prompt-specific Pareto frontiers from the given reward models. No quantitative diagnostics (frontier coverage, sensitivity to reward scaling or correlations, or ablation on frontier quality) are reported, leaving open whether the gains are artifacts of the particular reward set rather than a general property of PG-OT.
  2. [Experiments] Experiments: The human-study win rate lacks reported details on number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls. Without these, it is difficult to assess whether the 80% figure generalizes beyond the tested prompts or is inflated by evaluation confounds.
  3. [Methods] Methods (online/offline strategies): The mapping via optimal transport is claimed to reduce reward hacking without introducing new instabilities or excessive cost, yet no analysis of computational overhead, convergence behavior under weak reward models, or comparison of online vs. offline variants on JCR is provided to support this.
minor comments (2)
  1. [Methods] Notation for the Pareto frontier and transport plan should be introduced with an explicit equation early in the Methods section for clarity.
  2. [Related Work] The paper should cite prior work on Pareto optimization in multi-objective RL and optimal transport applications in generative modeling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The 11% JDR gain and ~80% human win rate rest on reliable construction of prompt-specific Pareto frontiers from the given reward models. No quantitative diagnostics (frontier coverage, sensitivity to reward scaling or correlations, or ablation on frontier quality) are reported, leaving open whether the gains are artifacts of the particular reward set rather than a general property of PG-OT.

    Authors: We acknowledge that additional diagnostics on Pareto frontier construction would strengthen the presentation. In the revised manuscript we will add quantitative metrics for frontier coverage, sensitivity analysis to reward scaling and correlations, and an ablation on frontier quality obtained by varying the reward-model subset. These additions will help confirm that the reported gains are not artifacts of the specific reward set. revision: yes

  2. Referee: [Experiments] Experiments: The human-study win rate lacks reported details on number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls. Without these, it is difficult to assess whether the 80% figure generalizes beyond the tested prompts or is inflated by evaluation confounds.

    Authors: We agree that these experimental details are necessary for proper evaluation of the human-study results. In the revised manuscript we will expand the Experiments section to report the number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls employed. revision: yes

  3. Referee: [Methods] Methods (online/offline strategies): The mapping via optimal transport is claimed to reduce reward hacking without introducing new instabilities or excessive cost, yet no analysis of computational overhead, convergence behavior under weak reward models, or comparison of online vs. offline variants on JCR is provided to support this.

    Authors: We note that the current manuscript already shows JCR reductions for the proposed method, yet we did not include explicit overhead or convergence analysis. In the revision we will add runtime benchmarks, convergence plots under varying reward-model quality, and a side-by-side JCR comparison of the online and offline variants. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the PG-OT framework by constructing prompt-specific Pareto frontiers from reward models and applying distribution-aware optimal transport to map samples, along with online/offline optimization variants. It defines the new metrics JDR and JCR independently to quantify multi-reward synergy and reward hacking. The reported 11% JDR gain and human win rates are presented as empirical results from applying the method to baselines, not as quantities that define or are fitted into the method itself. No equations reduce by construction to inputs, no predictions are statistically forced from fits, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text. The derivation from method description to metrics to experimental outcomes remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5526 in / 1139 out tokens · 45104 ms · 2026-05-14T19:19:39.897493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization , booktitle =

    Luca Eyring and Shyamgopal Karthik and Karsten Roth and Alexey Dosovitskiy and Zeynep Akata , editor =. ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization , booktitle =. 2024 , url =

  5. [5]

    T-VSL: text-guided visual sound source localization in mixtures

    Yanyu Li and Xian Liu and Anil Kag and Ju Hu and Yerlan Idelbayev and Dhritiman Sagar and Yanzhi Wang and Sergey Tulyakov and Jian Ren , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00763 , timestamp =

  6. [6]

    Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation , booktitle =

    Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang , editor =. Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation , booktitle =. 2024 , url =. doi:10.1007/97...

  7. [7]

    Patel, and Shao-Yuan Lo

    Kyungmin Lee and Xiahong Li and Qifei Wang and Junfeng He and Junjie Ke and Ming. Calibrated Multi-Preference Optimization for Aligning Diffusion Models , booktitle =. 2025 , url =. doi:10.1109/CVPR52734.2025.01721 , timestamp =

  8. [8]

    BalancedDPO: Adaptive Multi-Metric Alignment

    Dipesh Tamboli and Souradip Chakraborty and Aditya Malusare and Biplab Banerjee and Amrit Singh Bedi and Vaneet Aggarwal , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.12575 , eprinttype =. 2503.12575 , timestamp =

  9. [9]

    Learning Transferable Visual Models From Natural Language Supervision , booktitle =

    Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =. 2021 , url =

  10. [10]

    Junnan Li and Dongxu Li and Caiming Xiong and Steven C. H. Hoi , editor =. International Conference on Machine Learning,. 2022 , url =

  11. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Enhancing reward models for high-quality image generation: Beyond text-image alignment , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  12. [12]

    2026 , eprint=

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

  13. [13]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  14. [14]

    ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , booktitle =

    Jiazheng Xu and Xiao Liu and Yuchen Wu and Yuxuan Tong and Qinkai Li and Ming Ding and Jie Tang and Yuxiao Dong , editor =. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , booktitle =. 2023 , url =

  15. [15]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu and Yiming Hao and Keqiang Sun and Yixiong Chen and Feng Zhu and Rui Zhao and Hongsheng Li , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2306.09341 , eprinttype =. 2306.09341 , timestamp =

  16. [16]

    Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , booktitle =

    Yuval Kirstain and Adam Polyak and Uriel Singer and Shahbuland Matiana and Joe Penna and Omer Levy , editor =. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , booktitle =. 2023 , url =

  17. [17]

    T-VSL: text-guided visual sound source localization in mixtures

    Bram Wallace and Meihua Dang and Rafael Rafailov and Linqi Zhou and Aaron Lou and Senthil Purushwalkam and Stefano Ermon and Caiming Xiong and Shafiq Joty and Nikhil Naik , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00786 , timestamp =

  18. [18]

    Sinkhorn Distances: Lightspeed Computation of Optimal Transport , booktitle =

    Marco Cuturi , editor =. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , booktitle =. 2013 , url =

  19. [19]

    Histoire de l'Académie Royale des Sciences de Paris , year =

    Gaspard Monge , title =. Histoire de l'Académie Royale des Sciences de Paris , year =

  20. [20]

    2024 , eprint=

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. 2024 , eprint=

  21. [21]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  22. [22]

    2022 , eprint=

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. 2022 , eprint=

  23. [23]

    2023 , eprint=

    DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models , author=. 2023 , eprint=

  24. [24]

    2024 , eprint=

    Training Diffusion Models with Reinforcement Learning , author=. 2024 , eprint=

  25. [25]

    2023 , eprint=

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

  26. [26]

    2022 , eprint=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

  27. [27]

    2024 , eprint=

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

  28. [28]

    2015 , eprint=

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

  29. [29]

    2020 , eprint=

    Generative Modeling by Estimating Gradients of the Data Distribution , author=. 2020 , eprint=

  30. [30]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  31. [31]

    2022 , eprint=

    Learning to summarize from human feedback , author=. 2022 , eprint=

  32. [32]

    2021 , eprint=

    Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=

  33. [33]

    2024 , eprint=

    Large-scale Reinforcement Learning for Diffusion Models , author=. 2024 , eprint=

  34. [34]

    2024 , eprint=

    TextCraftor: Your Text Encoder Can be Image Quality Controller , author=. 2024 , eprint=

  35. [35]

    2024 , url=

    Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism , author=. 2024 , url=

  36. [36]

    2023 , eprint=

    Reinforcement Learning for Joint Optimization of Multiple Rewards , author=. 2023 , eprint=

  37. [37]

    2024 , eprint=

    Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning , author=. 2024 , eprint=

  38. [38]

    2024 , eprint=

    PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models , author=. 2024 , eprint=

  39. [39]

    2024 , eprint=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

  40. [40]

    2023 , eprint=

    DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models , author=. 2023 , eprint=

  41. [41]

    2023 , eprint=

    Human Preference Score: Better Aligning Text-to-Image Models with Human Preference , author=. 2023 , eprint=

  42. [42]

    2024 , eprint=

    Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation , author=. 2024 , eprint=

  43. [43]

    2026 , eprint=

    On the Plasticity and Stability for Post-Training Large Language Models , author=. 2026 , eprint=

  44. [44]

    2025 , eprint=

    Group Causal Policy Optimization for Post-Training Large Language Models , author=. 2025 , eprint=

  45. [45]

    2025 , eprint=

    Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs , author=. 2025 , eprint=

  46. [46]

    2025 , eprint=

    Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction , author=. 2025 , eprint=

  47. [47]

    2025 , eprint=

    PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Learning User Preferences for Image Generation Model , author=. 2025 , eprint=

  49. [49]

    2026 , eprint=

    Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits , author=. 2026 , eprint=