{"total":14,"items":[{"citing_arxiv_id":"2605.21214","ref_index":99,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Behavior-Consistent Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-20T14:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12004","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952-74965, 2023. [57] Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648, 2018. [58] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017. [59] Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang,"},{"citing_arxiv_id":"2605.06373","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\\tau$-Mixing","primary_cat":"stat.ML","submitted_at":"2026-05-07T14:52:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05812","ref_index":45,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","primary_cat":"cs.AI","submitted_at":"2026-05-07T07:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[88,92] 97.3 [96,98] 97.0 [95,98] 9939 humanoid-md 72.8 [63,87] 71.1 [69,73] 96.2 [94,98] 40.4 [39,42] 28.6 [25,32] 65.2 [54,76] 20.5 [11,31] 7.9 [6,10] 48.8 [36,62] 15.4 [8,25] 22.2 [14,30] 3.5 [2,5] 23.1 [20,26] antmaze-giant0.1 [0,0] 37.5 [34,41] 44.8 [33,56] 4.8 [0,14] 28.8 [26,32] 53.3 [51,56] 23.7 [12,36] 19.0 [18,20] 65.4 [62,69] 22.2 [12,34] 40.7 [37,45] 57.1 [53,62] 3.2 [2,4] Total 48.6 [46,51] 54.5 [54,55] 71.2 [69,74] 33.0 [31,35] 36.4 [34,39] 66.0 [63,69] 22.8 [20,26] 23.5 [22,24] 44.5 [42,47] 52.9 [50,56] 61.4 [60,63] 38 13 Table 5: humanoidmaze-giant per-task success rate (%)at the end of online training (mean across seeds), for the runs in Figure 1.Boldmarks methods within 95% of the per-row maximum."},{"citing_arxiv_id":"2605.01968","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdamO: A Collapse-Suppressed Optimizer for Offline RL","primary_cat":"cs.LG","submitted_at":"2026-05-03T16:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01862","ref_index":199,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL","primary_cat":"cs.LG","submitted_at":"2026-05-03T13:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23056","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-24T22:54:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04539","ref_index":87,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control","primary_cat":"cs.LG","submitted_at":"2026-04-06T09:03:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024. [86] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control.Software Impacts, 6:100022, 2020. [87] Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648, 2018. [88] Twan Van Laarhoven. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350, 2017. [89] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and"},{"citing_arxiv_id":"2510.02590","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-10-02T21:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08660","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Replicable Reinforcement Learning with Linear Function Approximation","primary_cat":"cs.LG","submitted_at":"2025-09-10T14:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces replicable random design regression and covariance estimation tools to enable the first provably efficient replicable RL algorithms for linear MDPs in generative and episodic settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00275","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Deep Double Q-learning","primary_cat":"cs.LG","submitted_at":"2025-06-30T21:32:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.04832","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Plasticity Loss in Deep Reinforcement Learning: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-07T16:13:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.01643","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems","primary_cat":"cs.LG","submitted_at":"2020-05-04T17:00:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1911.11361","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Behavior Regularized Offline Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-11-26T06:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}