arxiv: 2605.04653 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: unknown

Threshold-Guided Optimization for Visual Generative Models

Aosong Feng, Fei Shen, Jason Li, Jinbin Bai, Kaidong Yu, Qingyu Shi, Yi Xin, Yu Lei, Zhuoran Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords preference alignmentKL-regularized objectiveunpaired scalar feedbackvisual generative modelsthreshold estimationdiffusion modelsmasked generative modelsreward models

0 comments

The pith

Replacing an intractable per-sample baseline with a single global threshold allows visual generative models to be aligned using unpaired scalar feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the optimal policy under the KL-regularized alignment objective compares each sample's reward to its own instance-specific baseline. Because that baseline cannot be computed in practice, the authors replace it with one global threshold calculated from the distribution of observed reward scores. This substitution converts the alignment task into a binary classification problem that works on independent scalar ratings rather than requiring paired comparisons. A confidence weight is added to give higher influence to samples whose scores lie far from the threshold, which improves how much is learned from each rating. Across diffusion and masked generative models, three test sets, and five reward models, the resulting method produces stronger preference alignment than prior approaches that rely on pairs.

Core claim

The optimal policy for KL-regularized alignment implicitly compares each sample's reward to an instance-specific baseline. Since this baseline is intractable, a global threshold estimated from empirical score statistics can be used instead. This reformulation converts the alignment problem into a binary decision on unpaired data, with a confidence weighting term to focus on informative samples. The resulting threshold-guided framework achieves improved preference alignment in visual generative models without needing paired comparisons.

What carries the argument

The threshold-guided alignment framework, which estimates a data-driven global threshold from reward score statistics to replace the instance-specific baseline in the KL-regularized objective.

If this is right

Alignment can be performed directly on scalar ratings without any need for annotated preference pairs.
A confidence weighting term that up-weights samples far from the threshold increases sample efficiency.
The same framework applies equally to diffusion models and masked generative models.
Consistent gains appear over previous pair-based methods across three test sets and five reward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Collecting human feedback becomes cheaper because single scalar ratings are simpler to obtain than paired comparisons.
The same global-threshold substitution could be tested in language-model alignment where scalar rewards are already collected at scale.
Adaptive or model-specific ways to set the threshold might further reduce the gap to the true instance baselines.

Load-bearing premise

A single global threshold derived from the overall score statistics serves as an adequate stand-in for the sample-specific baseline required by the optimal alignment policy.

What would settle it

On a dataset where true instance-specific baselines can be computed exactly, showing that optimization with the global threshold produces clearly worse alignment than optimization with the true per-sample baselines would falsify the sufficiency of the approximation.

Figures

Figures reproduced from arXiv: 2605.04653 by Aosong Feng, Fei Shen, Jason Li, Jinbin Bai, Kaidong Yu, Qingyu Shi, Yi Xin, Yu Lei, Zhuoran Zhao.

**Figure 1.** Figure 1: The overview of the threshold-guided optimization framework for visual generative models. Scalar feedback is converted into pseudo-positive/pseudo-negative labels via a datadriven threshold, with confidence weighting based on the distance to the threshold. of a KL-regularized policy objective, demonstrating that complex human values can be incorporated through learned rewards. However, the operational com… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on Stable Diffusion v1.5. Each column corresponds to one finetuning method (from left to right: TGO (ours), DSPO, Diffusion-KTO, Diffusion-DPO, AlignProp, CSFT, SFT, and the original SD v1.5). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. the practical range p ∈ {0.3, 0.5} across Pick-a… view at source ↗

**Figure 3.** Figure 3: Score distributions by model and reward metric for SD v1.4 view at source ↗

**Figure 4.** Figure 4: GPT-based evaluation of threshold-guided optimization over Stable Diffusion v1.5 on the Pick-a-Pic test set. For each baseline, we show a stacked horizontal bar indicating the fraction of prompts where GPT-5 prefers TGO, judges a tie, or prefers the baseline view at source ↗

**Figure 5.** Figure 5: More qualitative comparison on Stable Diffusion v1.5. Each column corresponds to one finetuning method (from left to right: TGO (ours), DSPO, Diffusion-KTO, Diffusion-DPO, AlignProp, CSFT, SFT, and the original SD v1.5). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 17 view at source ↗

**Figure 6.** Figure 6: More qualitative comparison on Stable Diffusion v1.5. Each column corresponds to one finetuning method (from left to right: TGO (ours), DSPO, Diffusion-KTO, Diffusion-DPO, AlignProp, CSFT, SFT, and the original SD v1.5). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 18 view at source ↗

**Figure 7.** Figure 7: More qualitative comparison on Meissonic. Each column corresponds to one finetuning method (from left to right: TGO (ours), Diffusion-KTO, CSFT and SFT). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 19 view at source ↗

**Figure 8.** Figure 8: More qualitative comparison on Meissonic. Each column corresponds to one finetuning method (from left to right: TGO (ours), Diffusion-KTO, CSFT and SFT). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 20 view at source ↗

**Figure 9.** Figure 9: More qualitative comparison on Flux. Each column corresponds to one finetuning method (from left to right: TGO (ours), Diffusion-KTO, CSFT and SFT). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 21 view at source ↗

**Figure 10.** Figure 10: More qualitative comparison on Flux. Each column corresponds to one finetuning method (from left to right: TGO (ours), Diffusion-KTO, CSFT and SFT). Each row shows images generated from the same text prompt (listed on the left), sampled from the HPS v2, Pick-a-Pic, and PartiPrompts test sets. 22 view at source ↗

read the original abstract

Aligning large visual generative models with human feedback is often performed through pairwise preference optimization. While such approaches are conceptually simple, they fundamentally rely on annotated pairs, limiting scalability in settings where feedback is collected as independent scalar ratings. In this work, we revisit the KL-regularized alignment objective and show that the optimal policy implicitly compares each sample's reward to an instance-specific baseline that is generally intractable. We propose a threshold-guided alignment framework that replaces this oracle baseline with a data-driven global threshold estimated from empirical score statistics. This formulation turns alignment into a binary decision task on unpaired data, enabling effective optimization directly from scalar feedback. We also incorporate a confidence weighting term to emphasize samples whose scores deviate strongly from the threshold, improving sample efficiency. Experiments across both diffusion and masked generative paradigms, spanning three test sets and five reward models, show that our method consistently improves preference alignment over previous methods. These results position our threshold-guided framework as a simple yet principled alternative for aligning visual generative models without paired comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces the per-sample baseline in KL-regularized alignment with a single global threshold from reward statistics, turning scalar feedback into weighted binary classification, but offers no error bounds or variance analysis on that substitution.

read the letter

The core move is to observe that the optimal policy under the standard KL-regularized objective compares each reward to an instance-specific baseline b(x), then swap that for one data-driven threshold τ computed from the empirical score distribution. This lets them train on unpaired scalar ratings by framing the problem as binary classification plus a confidence weight for samples far from τ. Experiments on diffusion and masked generators across three datasets and five reward models show consistent gains over prior unpaired or paired methods, which is the main empirical support.

Referee Report

3 major / 2 minor

Summary. The paper claims that the KL-regularized alignment objective for visual generative models has an optimal policy that compares each reward r(x,y) to an intractable instance-specific baseline b(x). It proposes replacing b(x) with a single global threshold τ estimated from empirical reward score statistics on unpaired data, reformulating alignment as a binary classification task with an added confidence-weighting term that emphasizes samples far from τ. Experiments on diffusion and masked generative models across three test sets and five reward models report consistent gains in preference alignment over prior methods.

Significance. If the global-threshold substitution can be shown to be a controlled approximation rather than an ad-hoc replacement, the framework would meaningfully expand scalable alignment to scalar feedback settings that avoid the cost of collecting paired preferences. The breadth of the experimental evaluation (multiple generative paradigms and reward models) would then constitute a practical contribution, provided the source of the observed gains is isolated.

major comments (3)

[Method (optimal policy derivation)] Method section deriving the optimal policy: the manuscript states that the global threshold τ approximates the instance-specific baseline b(x) but supplies neither an error bound nor conditions on reward variance or baseline variation across prompts x under which the empirical statistic is guaranteed to be a valid proxy; without this analysis the central substitution remains formally unsupported.
[Experiments] Experiments section: the reported improvements over baselines are not accompanied by ablations that hold the binary framing and weighting fixed while varying the threshold choice (or vice versa), nor by comparisons against even approximate instance-level baselines; consequently it is impossible to determine whether gains arise from the proposed approximation or from other modeling choices.
[Method (confidence weighting)] Section introducing the confidence weighting: the weighting term is motivated as emphasizing samples whose scores deviate strongly from τ, yet no analysis is given of how the weighting interacts with the approximation error of τ itself or of its effect on the effective objective relative to the original KL-regularized loss.

minor comments (2)

Notation for the global threshold τ and the instance-specific baseline b(x) is introduced without an explicit comparison table or equation that juxtaposes the two quantities side-by-side, making it harder for readers to track the substitution.
The abstract and introduction refer to “five reward models” and “three test sets” but the experimental tables do not include a clear legend or appendix entry listing the exact identities and sources of these models and sets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional theoretical and empirical support would strengthen the manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: Method section deriving the optimal policy: the manuscript states that the global threshold τ approximates the instance-specific baseline b(x) but supplies neither an error bound nor conditions on reward variance or baseline variation across prompts x under which the empirical statistic is guaranteed to be a valid proxy; without this analysis the central substitution remains formally unsupported.

Authors: We acknowledge that the current manuscript does not supply a formal error bound or explicit conditions guaranteeing the validity of τ as a proxy for b(x). The substitution is motivated by the intractability of instance-specific baselines and the practical utility of a global threshold estimated from unpaired reward statistics. In the revised manuscript we will add a dedicated paragraph in Section 3 that (i) states the approximation explicitly, (ii) provides sufficient conditions based on bounded reward variance and limited variation of b(x) across prompts (supported by measurements reported in the appendix), and (iii) derives a simple probabilistic bound on the deviation using Chebyshev’s inequality. This addition will clarify the regimes in which the method is expected to be reliable. revision: partial
Referee: Experiments section: the reported improvements over baselines are not accompanied by ablations that hold the binary framing and weighting fixed while varying the threshold choice (or vice versa), nor by comparisons against even approximate instance-level baselines; consequently it is impossible to determine whether gains arise from the proposed approximation or from other modeling choices.

Authors: We agree that isolating the contribution of the threshold approximation requires targeted ablations. In the revised version we will add two new experiments: (1) an ablation that fixes the binary classification framing and confidence weighting while varying only the threshold choice (mean, median, and selected quantiles), and (2) a comparison against an approximate instance-level baseline obtained by Monte-Carlo sampling of multiple outputs per prompt on a held-out subset. These results will be presented in an expanded table and discussed in the experiments section to attribute performance gains more precisely. revision: yes
Referee: Section introducing the confidence weighting: the weighting term is motivated as emphasizing samples whose scores deviate strongly from τ, yet no analysis is given of how the weighting interacts with the approximation error of τ itself or of its effect on the effective objective relative to the original KL-regularized loss.

Authors: We will add a short theoretical remark in the method section that analyzes the interaction. The weighted objective can be expressed as a reweighted version of the original KL-regularized loss; the difference introduced by replacing b(x) with τ is bounded by the expectation of |τ − b(x)| multiplied by the confidence weight. We will also report empirical measurements of this error term across the evaluated reward models, showing that it remains small under the operating conditions of our experiments. This analysis will be included as a new proposition with supporting discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper starts from the standard KL-regularized alignment objective, derives the implicit comparison to an instance-specific baseline (a known result in the RLHF literature), and then introduces a practical heuristic replacement by a global threshold computed from reward score statistics. This substitution is presented as an engineering approximation rather than a derived equality. The experimental results compare the resulting method against prior approaches on separate test sets and reward models, providing external validation. No equation or claim reduces the final performance improvement to the input data by construction, nor does any load-bearing step rely on self-citation for uniqueness or ansatz smuggling. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard KL-regularized RLHF objective and the assumption that a global empirical threshold can stand in for the per-instance baseline; no new entities are postulated.

free parameters (1)

global threshold
Estimated from empirical score statistics of the collected rewards; its exact computation rule is not specified in the abstract.

axioms (1)

domain assumption KL-regularized alignment objective defines the optimal policy via comparison to an instance-specific baseline
Invoked in the first paragraph of the abstract as the starting point for the derivation.

pith-pipeline@v0.9.0 · 5488 in / 1218 out tokens · 15885 ms · 2026-05-08T17:10:51.666504+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 25 canonical work pages · 14 internal anchors

[1]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis , author=. arXiv preprint arXiv:2306.09341 , year=

work page internal anchor Pith review arXiv
[2]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[3]

Advances in neural information processing systems , volume=

Pick-a-pic: An open dataset of user preferences for text-to-image generation , author=. Advances in neural information processing systems , volume=
[4]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=
[5]

Advances in neural information processing systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=
[6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[7]

2024 , howpublished=

Black-Forest-Labs , title=. 2024 , howpublished=

2024
[8]

Lipo: Listwise preference optimization through learning-to-rank , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[9]

Communications of the ACM , volume=

Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=

2020
[10]

2013 , publisher=

Auto-encoding variational bayes , author=. 2013 , publisher=

2013
[11]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[12]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[13]

Classifier-Free Diffusion Guidance

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review arXiv
[14]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[15]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[16]

Training Diffusion Models with Reinforcement Learning

Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=

work page internal anchor Pith review arXiv
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[18]

arXiv preprint arXiv:2507.08068 , year=

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions , author=. arXiv preprint arXiv:2507.08068 , year=

work page arXiv
[19]

arXiv e-prints , pages=

Preference optimization as probabilistic inference , author=. arXiv e-prints , pages=
[20]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv
[21]

Proceedings of the 24th international conference on Machine learning , pages=

Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=
[22]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

work page internal anchor Pith review arXiv 1910
[23]

Advances in Neural Information Processing Systems , volume=

On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting , author=. Advances in Neural Information Processing Systems , volume=
[24]

arXiv preprint arXiv:2302.08215 , year=

Aligning language models with preferences through f-divergence minimization , author=. arXiv preprint arXiv:2302.08215 , year=

work page arXiv
[25]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952
[26]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
[27]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review arXiv
[28]

Kairan Zhao, Meghdad Kurmanji, George-Octavian Bărbulescu, Eleni Triantafillou, and Peter Triantafillou

Negative preference optimization: From catastrophic collapse to effective unlearning , author=. arXiv preprint arXiv:2404.05868 , year=

work page arXiv
[29]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[30]

arXiv preprint arXiv:2402.01878 , year=

Lipo: Listwise preference optimization through learning-to-rank , author=. arXiv preprint arXiv:2402.01878 , year=

work page arXiv
[31]

Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image syn- thesis.arXiv preprint arXiv:2410.08261, 2024

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis , author=. arXiv preprint arXiv:2410.08261 , year=

work page arXiv
[32]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

work page internal anchor Pith review arXiv
[33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[34]

2010 , publisher=

Modeling interaction via the principle of maximum causal entropy , author=. 2010 , publisher=

2010
[35]

Improving Video Generation with Human Feedback

Improving video generation with human feedback , author=. arXiv preprint arXiv:2501.13918 , year=

work page internal anchor Pith review arXiv
[36]

Advances in Neural Information Processing Systems , volume=

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[37]

Aligning text-to-image diffusion models with reward backpropagation , author=
[38]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[39]

The Thirteenth International Conference on Learning Representations , year=

DSPO: Direct score preference optimization for diffusion model alignment , author=. The Thirteenth International Conference on Learning Representations , year=
[40]

Aligning Text-to-Image Models using Human Feedback

Aligning text-to-image models using human feedback , author=. arXiv preprint arXiv:2302.12192 , year=

work page internal anchor Pith review arXiv
[41]

Using human feedback to fine-tune diffusion models without any reward model

A dense reward view on aligning text-to-image diffusion with preference , author=. arXiv preprint arXiv:2402.08265 , year=

work page arXiv
[42]

Way oﬀ-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456,

Way off-policy batch deep reinforcement learning of implicit human preferences in dialog , author=. arXiv preprint arXiv:1907.00456 , year=

work page arXiv 1907
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[44]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review arXiv
[45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Prdp: Proximal reward difference prediction for large-scale reward finetuning of diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Using human feedback to fine-tune diffusion models without any reward model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[47]

Re- fining alignment framework for diffusion models with intermediate-step preference ranking.arXiv preprint arXiv:2502.01667,

Refining alignment framework for diffusion models with intermediate-step preference ranking , author=. arXiv preprint arXiv:2502.01667 , year=

work page arXiv
[48]

Lumina-next: Making lumina-t2x stronger and faster with next-dit , author=
[49]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , volume=

work page internal anchor Pith review arXiv
[50]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=

work page internal anchor Pith review arXiv
[51]

Advances in Neural Information Processing Systems , volume=

Aligning diffusion models by optimizing human utility , author=. Advances in Neural Information Processing Systems , volume=
[52]

arXiv preprint arXiv:2502.02088 , year=

IPO: Iterative preference optimization for text-to-video generation , author=. arXiv preprint arXiv:2502.02088 , year=

work page arXiv
[53]

Onlinevpo: Align video diffusion model with online video-centric preference optimization,

Onlinevpo: Align video diffusion model with online video-centric preference optimization , author=. arXiv preprint arXiv:2412.15159 , year=

work page arXiv
[54]

DanceGRPO: Unleashing GRPO on Visual Generation

DanceGRPO: Unleashing GRPO on Visual Generation , author=. arXiv preprint arXiv:2505.07818 , year=

work page internal anchor Pith review arXiv
[55]

Flow-GRPO: Training Flow Matching Models via Online RL

Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=

work page internal anchor Pith review arXiv
[56]

Computer Science

Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=
[57]

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model , author=
[58]

Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

Lumina-image 2.0: A unified and efficient image generative framework , author=. arXiv preprint arXiv:2503.21758 , year=

work page arXiv
[59]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Scalable ranked preference optimization for text-to-image generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=