{"total":16,"items":[{"citing_arxiv_id":"2606.01382","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Exploration for Iterative Nash Preference Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-31T18:11:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24939","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs","primary_cat":"cs.LG","submitted_at":"2026-05-24T08:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Establishes global linear convergence of entropy-regularized policy gradient in continuous MDPs with log-linear softmax policies under Q-realizability by bounding non-uniform PL constants in two feature regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24357","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refined Analysis of Entropy-Regularized Actor-Critic","primary_cat":"cs.LG","submitted_at":"2026-05-23T02:41:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Exact critic in entropy-regularized actor-critic yields strong variance reduction, enabling Õ(log(1/ε)) sample complexity for ε-optimal regularized value.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22622","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A note on convergence of Wasserstein policy optimization","primary_cat":"cs.LG","submitted_at":"2026-05-21T15:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The note claims linear convergence of WPO in entropy-regularized MDPs by combining mean-field gradient flow analysis with a local log-Sobolev inequality under a regularity assumption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22507","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Modeling by Value-Driven Transport","primary_cat":"cs.LG","submitted_at":"2026-05-21T13:57:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A control-theoretic linear program yields value-driven transport policies for generative modeling with straight paths and simulation-free training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18591","ref_index":131,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation","primary_cat":"cs.LG","submitted_at":"2026-05-18T16:05:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15651","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sharp Spectral Thresholds for Logit Fixed Points","primary_cat":"cs.LG","submitted_at":"2026-05-15T06:11:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"For finite-dimensional affine logit systems the sharp dimension-free stability threshold is β‖ΠWΠ‖_{T→T}<2, extending the certified regime beyond classical conservative bounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11020","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates","primary_cat":"cs.LG","submitted_at":"2026-05-10T15:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09214","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability","primary_cat":"cs.LG","submitted_at":"2026-05-09T23:17:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08946","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets","primary_cat":"cs.LG","submitted_at":"2026-05-09T13:35:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For future work, we highlight two directions. (i) Extend the analysis to function approximation by revisiting approximate-DP error propagation through our \"soft fixed point + exponential improvement\" structure [43]. (ii) Characterize the existence/selection and ω-continuity of limiting policies π∞(ω), potentially via homotopy/temperature-scheduling frameworks [44]. 9 References [1] D. J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of Mathematical Analysis and Applications, 89(2):639-647, 1982. doi: 10.1016/0022-247X(82) 90122-6. [2] Kristof Van Moffaert and A Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies.J. Mach. Learn. Res., 15(107):3483-3512, 2014."},{"citing_arxiv_id":"2605.07775","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-08T14:16:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19695","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Planning in entropy-regularized Markov decision processes and games","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17415","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-04-19T12:47:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23927","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration","primary_cat":"stat.ML","submitted_at":"2025-12-30T00:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stationary reweighting of soft fitted Q-iteration yields finite-sample local linear convergence to the projected fixed point under approximate realizability and controlled weighting error, even without Bellman completeness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.00286","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model","primary_cat":"cs.LG","submitted_at":"2025-05-30T22:27:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives PAC-type upper bounds and matching lower bounds on sample complexity for value and policy learning under recursive entropic risk measures, with exponential dependence on |β|/(1-γ).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.04214","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Entropic Regularization of Markov Decision Processes","primary_cat":"cs.LG","submitted_at":"2019-07-06T15:02:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Using alpha-divergences for entropic regularization in MDPs unifies actor-critic architectures via closed-form policy improvement and provides asymptotic analysis on standard RL problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}