{"total":12,"items":[{"citing_arxiv_id":"2605.23551","ref_index":97,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Goal-Conditioned Agents that Learn Everything All at Once","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22693","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments","primary_cat":"cs.RO","submitted_at":"2026-05-21T16:36:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scout-Assisted Planning uses UAV scouts and a GNN to predict information gain for pruning actions, cutting UGV travel costs by 31.9-37.7% versus the Canadian Traveler Problem baseline in partially known environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17431","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings","primary_cat":"cs.LG","submitted_at":"2026-05-17T12:52:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MATE uses permutation-invariant sum-aggregated memory of transition embeddings to solve CMDPs with online adaptation and computational advantages over Transformers and RNNs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"initial state or other guidance signals y(e.g., goals, rewards): ˆε↓ pω(ε|s0,y).104 II. Diffusion Policy : In contrast to diffusion planners, Diffusion Policy methods directly parameterize105 the policy ωω(a|s)using diffusion models. For example, Diffusion Policy [ 55] uses a diffusion106 model to generate actions with expressive multimodal distributions. DPPO [ 56] extends this idea107 by modeling a two-layer MDP structure, which enables ﬁne-tuning of diffusion-based policies in108 RL settings. Another line of work integrates diffusion models with value-based methods (e.g., Q-109 learning), to generate multimodal action distributions guided by learned value functions, such as110 Diffusion-QL [ 57], IDQL [ 58], CPQL [ 59], CEP [ 60], and DWM [ 61]."},{"citing_arxiv_id":"2605.10816","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Policy Gradient Methods for Non-Markovian Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-11T16:34:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08754","ref_index":28,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations","primary_cat":"cs.AI","submitted_at":"2026-05-09T07:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08406","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Effective Explanations Support Planning Under Uncertainty","primary_cat":"cs.CL","submitted_at":"2026-05-08T19:12:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Explanations scored higher by an LLM-plus-planner model are judged more helpful by people and produce measurably better navigation performance in uncertain environments than lower-scored or no explanations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing helpfulness from the full model's utility score, a length- only score, and a direct-action score entered simultaneously, with random intercepts for participants and maps. The util- ity score uniquely predicted higher helpfulness,β UTIL = 38.47, 95% CI= [36.24,40.17]. In contrast, the length-only and direct-action predictors were weaker (β LEN =.85, 95% CI= [−1.23,2.94],β DIRECT =.08, 95% CI= [−2.10,2.28]). Fig. 4a shows helpfulness as a function of utility, with the partial regression line from the joint model holding the other predictors at their means. Overall, helpfulness reflects more than brevity: participants preferred explanations that the util- ity model predicts will support reliable and efficient guid- ance. Leave-one-component-out ablations further showed"},{"citing_arxiv_id":"2605.05373","ref_index":47,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-06T18:53:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03413","ref_index":216,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Theorize the World from Observation","primary_cat":"cs.LG","submitted_at":"2026-05-05T06:39:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18847","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Human-Guided Harm Recovery for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-04-20T21:12:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":69,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.06114","ref_index":119,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning Interactive Real-World Simulators","primary_cat":"cs.AI","submitted_at":"2023-10-09T19:42:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}