{"total":17,"items":[{"citing_arxiv_id":"2606.27771","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-26T06:56:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NormGuard adds a training-time hinge penalty on velocity norm inflation in flow-matching RL to improve MLLM-judged image quality and forensic realism while preserving reward across multiple setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23897","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ETCHR: Editing To Clarify and Harness Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15803","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embedding-perturbed Exploration Preference Optimization for Flow Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:56:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15190","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11480","ref_index":34,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Adjoint Matching for Fine-tuning Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T03:55:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021. [33] J. Wang, J. Liang, J. Liu, H. Liu, G. Liu, J. Zheng, W. Pang, A. Ma, Z. Xie, X. Wang, M. Wang, P. Wan, and X. Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025. [34] Y . Wang, Z. Li, Y . Zang, Y . Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang. Pref-GRPO: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv:2508.20751, 2025. [35] X. Wu, Y . Hao, M. Zhang, K. Sun, Z. Huang, G. Song, Y . Liu, and H. Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024."},{"citing_arxiv_id":"2605.10937","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"quality or semantic alignment, reward hacking encourage the model to capitalize on biases in the reward model [8]. Consequently, the model could adopt shortcut solutions that inflate the reward signal, including overemphasis on reward-favored visual patterns or reliance on spurious correlations, leading to perceptually implausible or misaligned generations. As demonstrated in [9], reward hacking can be attributed to minimal reward differences among generated images (illustrative examples in Appendix A), leading to illusory advantages. Based on this, let's consider the following example: Preprint. arXiv:2605.10937v1 [cs.CV] 11 May 2026 Example:We assume that CLIPScore is employed as the reward model. First, consider a group of"},{"citing_arxiv_id":"2605.10983","ref_index":20,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:41:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27505","ref_index":58,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[56] Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025. [57] Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025. [58] Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. [59] Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He,"},{"citing_arxiv_id":"2604.19406","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HP-Edit: A Human-Preference Post-Training Framework for Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-21T12:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"preprint arXiv:2503.12575, 2025. 3 [45] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228-8238, 2024. 2, 3 [46] Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for sta- ble text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3 [47] Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, and Wangmeng Zuo. Ace: Anti-editing concept erasure in"},{"citing_arxiv_id":"2604.18966","ref_index":219,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training","primary_cat":"cs.LG","submitted_at":"2026-04-21T01:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15311","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2503.12575, 2025. [46] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228-8238, 2024. [47] Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. [48] Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding"},{"citing_arxiv_id":"2604.14910","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward-Aware Trajectory Shaping for Few-step Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-04-16T11:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. 2024. EM Distillation for One- step Diffusion Models. InNeurIPS. [40] Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel FN Gordon, Xueli An, and Benjamin Busam. 2025. STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models.arXiv preprint arXiv:2512.05107 (2025). [41] Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al . 2024. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059(2024). [42] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang,"},{"citing_arxiv_id":"2604.13602","ref_index":187,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Granular Process Verificationaims to reduce objective compression by supervising intermediate steps [178, 186]. For example, ContextRL [186] provides the reward model with full reference solutions, while RuCL [178] uses stratified rubrics to evaluate grounding. Integrating external verifiers like detection models or SAM-2 allows for step-by-step validation [187]. In process reward modeling, PS-GRPO [ 208] identifies \"drop-moments\" to penalize false-positive rollouts.Perception-Reasoning Synergyprevents models from using language shortcuts by requiring visual anchoring [166, 167, 179]. PEARL [167] uses a fidelity gate to halt updates on samples with failed perception, while DoGe [179] forces the model to analyze visual context before seeing the question."},{"citing_arxiv_id":"2512.01236","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards","primary_cat":"cs.CV","submitted_at":"2025-12-01T03:25:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.21583","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization","primary_cat":"cs.CV","submitted_at":"2025-10-24T15:50:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16888","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback","primary_cat":"cs.CV","submitted_at":"2025-10-19T15:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}