{"total":10,"items":[{"citing_arxiv_id":"2605.23551","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Goal-Conditioned Agents that Learn Everything All at Once","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21800","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation","primary_cat":"cs.LG","submitted_at":"2026-05-20T22:58:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09364","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-10T06:27:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"condition is strictly necessary for a specific performance regime, collectively mitigating value overestimation and yielding a temporally-grounded, high-rank latent space. 2 RELATED WORKS Dynamics-based Representation Learning.Leveraging system dynamics is a foundational approach for shaping representations in complex, partially observable environments [22, 32]. Both auxiliary model-free tasks [2, 11, 26, 34] and latent world models [13, 14, 15, 16, 17, 33, 35] rely on forward prediction to force the agent to understand state transitions and action effects. Recently, MRQ [10] explicitly leveraged this predictive signal to construct highly effective latent spaces for standard Q- learning in dense-reward settings. However, because these formulations are inherently task-agnostic, they satisfy onlydynamical alignment, and transitional reward modelling."},{"citing_arxiv_id":"2605.07278","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Predictive but Not Plannable: RC-aux for Latent World Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:43:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"conditioned control tasks. Local LeWM-family results are reported as mean±std over five fixed evaluation groups. The matched ∆ row compares RC-aux against LeWM-cont when available and against LeWM for Wall. Method TwoRoom Reacher Push-T Wall Cube DINO-WM [47] 100.0 79.0 74.0 96.0 86.0PLDM [38] 97.0 78.0 78.0 - 65.0DINO-WM+prop [47] - - 92.0 - -GCBC [13] 100.0 - 75.0 - 84.0IQL [23] 100.0 - 20.0 - 64.0IVL [33] 100.0 - 33.0 - 56.0 LeWM [29] 88.8±3.0 81.2±7.9 90.4±3.0 50.4±6.5 72.4±5.9LeWM-cont [29] 88.8±3.0 82.8±7.291.2±3.9- 72.8±5.2RC-aux 98.0±1.487.2±6.490.8±3.383.6±3.676.0±7.5 Matched∆ +9.2 +4.4 -0.4 +33.2 +3.2 Table 1 reports success rates on the five pixel-based goal-conditioned con- trol tasks."},{"citing_arxiv_id":"2605.03075","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Refining Compositional Diffusion for Reliable Long-Horizon Planning","primary_cat":"cs.RO","submitted_at":"2026-05-04T18:44:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-based long-horizon tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22724","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories","primary_cat":"cs.RO","submitted_at":"2026-04-24T17:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks while running thousands of times faster than optimization solvers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11137","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning","primary_cat":"cs.AI","submitted_at":"2026-04-13T07:49:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CGCL progressively trains LLMs to generate Toulmin-structured clinical diagnostic arguments across three curriculum stages, achieving accuracy and reasoning quality comparable to RL methods with improved stability and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08960","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T05:04:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"tasks spanning both navigation and stitching scenarios, as well as 4 pixel-based sparse- reward navigation tasks. Baselines.We compare our method against six representative offline GCRL baselines included in OGBench, covering imitation-based, value-based, metric-based, contrastive, and hierarchical approaches. • Goal-conditioned behavioral cloning (GCBC) [18] is a simple imitation learning baseline that directly learns a goal-conditioned policy by mimicking actions from the offline dataset. • Goal-conditioned implicit V-learning (GCIVL) and goal-conditioned implicit Q-learning (GCIQL) [21, 6] estimate the goal-conditioned optimal value func- tion using IQL-style expectile regression. Policies are then extracted using advantage-"},{"citing_arxiv_id":"2603.19312","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels","primary_cat":"cs.LG","submitted_at":"2026-03-13T19:48:14+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2106.01345","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decision Transformer: Reinforcement Learning via Sequence Modeling","primary_cat":"cs.LG","submitted_at":"2021-06-02T17:53:39+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}