{"total":12,"items":[{"citing_arxiv_id":"2605.19294","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies","primary_cat":"cs.RO","submitted_at":"2026-05-19T03:14:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17295","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DISA: Offline Importance Sampling for Distribution-Matching LLM-RL","primary_cat":"cs.LG","submitted_at":"2026-05-17T07:14:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15458","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Models Can Reason with Verifiable Rewards","primary_cat":"cs.CV","submitted_at":"2026-05-14T22:40:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12625","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Driving Intents Amplify Planning-Oriented Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12379","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Discrete Flow Matching for Offline-to-Online Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"the CTMC, including both the jump destinations and the holdi ng times (Line 13). For CTMCs, the KL divergence between path measures Puθ andPuref admits a tractable decomposition via the Radon-Nikodym derivative [Kipnis and Landim, 2013, Zhang e t al., 2025b]: KL ( Puθ∥Puref ) = EPuθ [ ∑ k:jumps log uθ(Xt− k → Xtk,t k|s) uref (Xt− k → Xtk,t k|s) + ∫ 1 0 ( λ ref (Xt,t |s)− λθ(Xt,t |s) ) dt ] , (8) where λθ(i,t |s) = ∑ j̸=iuθ(i→ j,t |s) is the total exit rate. This is a Monte Carlo plug-in surrogate where gradients ﬂow only through the evaluated uθ and λθ along the simulated path, so 7 the term acts as a practical regularizer toward uref rather than as an unbiased gradient estimator of the path-space KL. For each sampled state sb, we estimate this quantity with a Monte Carlo"},{"citing_arxiv_id":"2605.10821","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unified Noise Steering for Efficient Human-Guided VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024. 12 [47] David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025. [48] Zhian Su, Weijie Kong, Haonan Dong, and Huixu Dong. Ig-rft: An interaction-guided rl frame- work for vla models in long-horizon robotic manipulation.arXiv preprint arXiv:2602.20715, 2026. [49] Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, et al. Aloe: Action-level off-policy evaluation"},{"citing_arxiv_id":"2605.08879","ref_index":29,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT","primary_cat":"cs.RO","submitted_at":"2026-05-09T10:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Unlike standard autoregressive policies, continuous-time flow-matching architectures lack closed-form action likelihoods ( logπ ). Evaluating these probability ratios strictly requires solving probability-flow Ordinary Differential Equations (ODEs), rendering direct computation prohibitively expensive during large-scale fine-tuning. To bypass this computational overhead, recent flow policy gradient methods [29] approximate the flow-matching policy ratio via the exponential loss difference: πθ(a|s) πbehavior(a|s) ≈exp \u0010 Lflow(θbehavior)− L flow(θ) \u0011 (1) where Lflow denotes the standard continuous-time regression objective. While existing studies explore applying online RL directly to flow-matching models [ 24, 30], we build upon this exponential approximation to formulate ConSFT, achieving bounded, sparse parameter updates entirely within"},{"citing_arxiv_id":"2605.08733","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generative Actor-Critic with Soft Bridge Policies","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:36:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[10] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.International Conference on Learning Representations (ICLR), 2021. [11] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. [12] David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025. [13] Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization."},{"citing_arxiv_id":"2605.03065","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:36:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23380","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think","primary_cat":"cs.LG","submitted_at":"2026-04-25T17:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"transition kernels over the induced state-action space via policy gradient methods [1, 4, 5]. Recent work has ex- tended this paradigm through GRPO-based variants and flow matching models [23, 45], alongside more sophisti- cated algorithmic designs addressing both theoretical and practical limitations [8, 19, 21, 39, 40]. While some online methods employ ELBO-based surro- gates, DDPO [1] and FPO [27] have shown these to un- derperform on visual generation tasks. In this work, we revisit this simple approach and demonstrate that this lim- itation is not fundamental: a set of simple yet effective techniques unlocks its full potential, achieving state-of-the- art performance with significantly improved training effi- ciency. Concurrent with our work, Advantage Weighted"},{"citing_arxiv_id":"2604.16519","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Positive-Only Drifting Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-15T17:01:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PODPO is a likelihood-free generative policy optimization method for online RL that steers actions to high-return regions using only positive-advantage samples and local contrastive drifting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06916","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling","primary_cat":"cs.LG","submitted_at":"2026-04-08T10:14:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[41] fine-tune text-to-image models by maximizing an offline reward-weighted denoising loss, while Fan et al.[42] extend this objective to an online setting with Wasserstein-2 regularization. Relatedly, Diffusion-DPO [43] offers a preference-optimization counterpart to this line, adapting DPO-style learning to diffusion model post-training without explicit rollouts. FMPG [44] and especially AWM [9] place this line on firmer policy-optimization footing by using the ELBO as a proxy for policy likelihood. This connection makes forward-process optimization a particularly compelling direction. DiffusionNFT [ 8] can be interpreted as an NFT-style [45] forward-process version of GRPO. Other works also explore forward-process variants"}],"limit":50,"offset":0}