{"total":15,"items":[{"citing_arxiv_id":"2605.21661","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hierarchical Variational Policies for Reward-Guided Diffusion","primary_cat":"cs.LG","submitted_at":"2026-05-20T19:13:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19750","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:18:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CPC-VAR adds Gradient-based Concept Neuron Selection for continual single-concept learning and a context-aware multi-branch composition strategy to reduce forgetting and entanglement in VAR-based personalized image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17602","ref_index":32,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-17T19:00:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12112","ref_index":72,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:29:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Pic [32] with 37,523 prompts, and evaluated on HPD [69], while SD3.5-M is evaluated on GenEval following [40]. For PEC and PCV AE, we tuneλ∈ {0.03,0.05,0.10} , with larger values to encourage diversity and smaller ones for quality. All other hyperparameters kept as in Flow-GRPO [40]. Metrics.Following MixGRPO [ 34], we evaluate generation quality using ImageReward [ 72], PickScore [32], Aesthetic Predictor v2.5 [15], CLIP [48], and Unified Reward [65]. For diversity, we sample 30 outputs per prompt for 400 HPD test prompts (12,000 images per method). We report DINO and CLIP feature variances, as well as Vendi Scores [20], which measure the effective rank of the normalized similarity matrix. Vendi Scores are computed in four feature spaces: V."},{"citing_arxiv_id":"2605.11723","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"introduce temporal, spatial, and attribution IoU rewards to guide the VLM toward grounded, interpretable anomaly CoT reasoning and stable judgments. 2 Related Work Video Reward Models.With the development of generative models [ 46, 45, 47, 48], reward modeling has become a key technique for aligning them with human preferences. Early single-scalar approaches [59, 21, 30] are insufficient for capturing multi-faceted visual quality. Subsequent video- domain methods incorporate human-annotated ratings [23, 15] and Bradley-Terry loss [27, 32], with some frameworks unifying cross-task evaluation [53] or extending to multi-dimensional scoring [16, 51, 27]. More recent VLM-based reward models further integrate CoT reasoning [ 52], dynamic"},{"citing_arxiv_id":"2605.08354","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria","primary_cat":"cs.AI","submitted_at":"2026-05-08T18:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"For generative quality assessment, we adopt GenEval [ 11], DPG-Bench[ 15], TIIF(test-mini-short)[40], and UniGenBench++[37] for text-to-image synthesis, complemented by GEdit-Bench[24] and ImgEdit[49] for editing tasks. Baselines and Implementation.For human preference evaluation, we compare against a suite of state-of-the-art trained reward models, including HPSv3 [28], PickScore [19], ImageReward [47], UnifiedReward[39] and UnifiedReward-Thinking [38], and EditReward [43], alongside representative VLM judges such as Qwen3-VL [2], GPT-5 [33], and Gemini 3.1 Pro [12]. Following the common practice in recent multimodal alignment and generation research [16, 22, 34], we adopt FLUX.1-dev [20] and Qwen-Image-Edit-2509 [41] as base models for image generation"},{"citing_arxiv_id":"2605.07800","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T14:36:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tora [19]).(iii) Post-training preference optimization.Following the RLHF recipe [ 20], the VDM is fine-tuned against a reward model via GRPO-style on-policy exploration that turns the flow-matching ODE [21] into an SDE [22], DPO-style paired classification over preferred / rejected samples [23, 24], or ReFL-style differentiable-reward back-propagation [25]. Post-training is largely orthogonal to SARA's SFT-stage gains, and we leave such combinations to future work. Representation alignment for diffusion models.The REPA family is the closest prior art to SARA and shares a single template: regularise a generative DiT by matching a chosen statistic of its hidden states to a frozen visual or video foundation encoder."},{"citing_arxiv_id":"2605.07253","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The training procedure is summarized in Algo- rithm 1. We use Eq.(8) as the training lossL(ϕ), which consists of a regularization term and a re- ward term. The former acts as an L2 penalty on the coefficient residual, encouraging prox- imity to the original Gaussian prior. The latter is a weighted combination of multiple reward models, including CLIP [10], HPSv2.1 [41], Im- ageReward [44], and PickScore [18]. Additional details on the reward formulation are provided in Appendix B.2. 3.3 Complexity Analysis We analyze the computational complexity of our transformer-based noise modulation framework, LENS. The computational complexity of a standard transformer is O(n2r+nr 2), where n is the number of tokens and r is the representation dimension [ 38]."},{"citing_arxiv_id":"2605.06070","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-07T11:56:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04494","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards General Preference Alignment: Diffusion Models at Nash Equilibrium","primary_cat":"cs.LG","submitted_at":"2026-05-06T04:50:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04461","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Stream-T1: Test-Time Scaling for Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-06T03:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27505","ref_index":64,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25427","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15311","ref_index":53,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07818","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DanceGRPO: Unleashing GRPO on Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Aligning Diffusion Models and Rectified Flows. Diffusion models and rectified flows can also benefit significantly from alignment with human feedback, but the exploration remains primitive compared with LLMs. Key approaches in this area include: (1) Direct Policy Optimization (DPO)-style [12, 14, 42, 53, 54] methods, (2) direct backpropagation with reward signals [55], such as ReFL [9], and (3) policy gradient-based methods, including DPOK [19] and DDPO [18]. However, production-level models predominantly rely on DPO and ReFL, as previous policy gradient methods have demonstrated instability when applied to large-scale settings. Our work addresses this limitation, providing a robust solution to enhance stability and scalability."}],"limit":50,"offset":0}