{"total":14,"items":[{"citing_arxiv_id":"2605.16054","ref_index":144,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14350","ref_index":106,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-14T04:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13401","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trajectory-Level Data Augmentation for Offline Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-13T11:57:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Trajectory-based data augmentation exploits geometric relationships between rewards, values, and logging policies to enable effective offline RL from few suboptimal trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11975","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T11:31:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09157","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting Mixture Policies in Entropy-Regularized Actor-Critic","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:37:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Nir Baram, Guy Tennenholtz, and Shie Mannor. Maximum entropy reinforcement learning with mixture policies.arXiv preprint arXiv:2103.10176, 2021. Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, and Alec Koppel. On the sample complex- ity and metastability of heavy-tailed policy search in continuous control.Journal of Machine Learning Research, 25(39):1-58, 2024. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang."},{"citing_arxiv_id":"2605.02461","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Middle-mile logistics through the lens of goal-conditioned reinforcement learning","primary_cat":"stat.ML","submitted_at":"2026-05-04T11:04:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Middle-mile logistics is cast as a multi-object goal-conditioned MDP and solved by combining graph neural networks with model-free RL via extraction of small feature graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01862","ref_index":152,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL","primary_cat":"cs.LG","submitted_at":"2026-05-03T13:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09096","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When a Robot is More Capable than a Human: Learning from Constrained Demonstrators","primary_cat":"cs.RO","submitted_at":"2025-10-10T07:48:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Robots outperform constrained human demonstrations by inferring state-only rewards from demos and using temporal interpolation to label and explore better trajectories, achieving 10x faster task completion on a real robotic arm than behavioral cloning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.03508","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D2 Actor Critic: Diffusion Actor Meets Distributional Critic","primary_cat":"cs.LG","submitted_at":"2025-10-03T20:47:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21039","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-06-26T06:35:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SSE improves long-horizon goal-conditioned RL by using failure and partial-success transitions to identify unreliable subgoals, streamline high-level planning, and outperform prior hierarchical methods on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.06347","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2024-10-08T20:35:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A Goal-Conditioned Decision Transformer is adapted for offline multi-goal RL and shown to outperform online baselines on a new Franka Emika Panda dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.04307","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Optimal Strategies for Temporal Tasks in Stochastic Games","primary_cat":"cs.AI","submitted_at":"2021-02-08T16:10:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Model-free RL learns optimal strategies in stochastic games for LTL specs by constructing a product with DPA and assigning rewards/discounts from acceptance conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.07113","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Solving Rubik's Cube with a Robot Hand","primary_cat":"cs.LG","submitted_at":"2019-10-16T00:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In this work, we consider two different tasks that both use the Shadow Dexterous Hand [99]: the block reorientation task from our previous work [77, 84] and the task of solving a Rubik's cube. Both tasks are visualized in Figure 3. We brieﬂy describe the details of each task in this section. 2.1 Block Reorientation The block reorientation task was previously proposed in [84] and solved on a physical robot hand in [77]. We brieﬂy review it here; please refer to the aforementioned citations for additional details. The goal of the block reorientation task is to rotate a block into a desired goal orientation. For example, in Figure 3a, the desired orientation is shown next to the hand with the red face facing up, the blue face facing to the left and the green"},{"citing_arxiv_id":"1906.09223","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Disentangled Skill Embeddings for Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-06-21T16:12:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Disentangled Skill Embeddings (DSE) is a variational inference framework for multi-task RL using shared parameters and task-specific latent embeddings for generalization to unseen conditions and as skills in hierarchical RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}