{"total":12,"items":[{"citing_arxiv_id":"2605.18675","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"COOPO: Cyclic Offline-Online Policy Optimization Algorithm","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:15:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14779","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Peng's Q($\\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-14T12:48:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14497","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-14T07:35:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10289","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:50:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"i (t) +N iµ(off) i Ti(t) +N i + NiVi Ti(t) +N i .(12) Combining the definition of¬Eµ(hyb) i (t) ˆµ(hyb) i (t)−µ (on) i > s Lt Ti(t) +N i + NiVi Ti(t) +N i , with (12), we can derive the following relation: ¬Eµ(hyb) i (t)⊆ ( Ti(t)ˆµ(on) i (t) +N iˆµ(off) i Ti(t) +N i − Ti(t)µ(on) i (t) +N iµ(off) i Ti(t) +N i > s Lt Ti(t) +N i ) .(13) Noting that in (13), E \" Ti(t)ˆµ(on) i (t) +N iˆµ(off) i Ti(t) +N i # = Ti(t)µ(on) i (t) +N iµ(off) i Ti(t) +N i . 18 GivenT i(t) =s, by Chernoff-Hoeffding bounds (Lemma 9), we have Pr sˆµ(on) i (t) +N iˆµ(off) i s+N i −E \" sˆµ(on) i (t) +N iˆµ(off) i s+N i # > r Lt s+N i ! ≤2e −(s+Ni) Lt s+Ni ≤ δt t . Thus Pr \u0010 ¬Eµ(hyb) i (t) \u0011 = Pr"},{"citing_arxiv_id":"2605.05863","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:32:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00416","ref_index":44,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-05-01T05:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[42] P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, \"Effi- cient online reinforcement learning with offline data,\" in International Conference on Machine Learning. PMLR, 2023, pp. 1577-1594. [43] A. Nair, A. Gupta, M. Dalal, and S. Levine, \"Awac: Accelerating online reinforcement learning with offline datasets,\"arXiv preprint arXiv:2006.09359, 2020. [44] Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krish- namurthy, and W. Sun, \"Hybrid rl: Using both offline and online data can make rl efficient,\"arXiv preprint arXiv:2210.06718, 2022. [45] A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, \"Steering your diffusion policy with latent space rein- forcement learning,\"arXiv preprint arXiv:2506."},{"citing_arxiv_id":"2604.17919","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fisher Decorator: Refining Flow Policy via a Local Transport Map","primary_cat":"cs.LG","submitted_at":"2026-04-20T07:54:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084-15097, 2021. [54] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1702-1712. PMLR, 2022. [55] Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022. [56] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine"},{"citing_arxiv_id":"2604.13966","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation","primary_cat":"cs.LG","submitted_at":"2026-04-15T15:17:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08958","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T04:57:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04142","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models","primary_cat":"cs.CV","submitted_at":"2026-04-05T15:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"stepwise decay to the rewards of trajectories in the buffer, so that older samples naturally become easier to replace, allowing newer, higher-quality trajectories to be incorporated more readily. Rollout with buffer.For the rollout stage of OP-GRPO, we adopt a hybrid RL paradigm, which follows the general philosophy of hybrid offline-to-online training [26,34]. In each batch, prompts are composed of a majority of prompts OP-GRPO 7 drawn directly from the original dataset and a minority sampled from the buffer Boff. This design enhances sample diversity while enabling effective reuse of high- value off-policy trajectories. Specifically, for prompts sampled directly from the dataset,Gtrajectories are generated per prompt."},{"citing_arxiv_id":"2510.21060","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Sample Complexity of Differentially Private Policy Optimization","primary_cat":"cs.LG","submitted_at":"2025-10-24T00:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.07986","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EXPO: Stable Reinforcement Learning with Expressive Policies","primary_cat":"cs.LG","submitted_at":"2025-07-10T17:57:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}