{"total":13,"items":[{"citing_arxiv_id":"2606.06556","ref_index":164,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robots Need More than VLA and World Models","primary_cat":"cs.RO","submitted_at":"2026-06-04T10:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31268","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mellum2 Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-29T13:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30788","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks","primary_cat":"cs.CL","submitted_at":"2026-05-29T03:25:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XLGoBench applies algorithmic tasks to expose persistent cross-lingual performance gaps in state-of-the-art LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18261","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"K2V extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning processes, yielding improved domain reasoning with preserved general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12484","ref_index":57,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning, Fast and Slow: Towards LLMs That Adapt Continually","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:58:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[55] Hongyao Tang, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Glen Berseth. Mit- igating plasticity loss in continual reinforcement learning by reducing churn, 2025. URLhttps: //arxiv.org/abs/2506.00592. 1, 5, 6, 9 [56] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024. URLhttps://arxiv.org/abs/2409.07429. 9 [57] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. 1 [58] Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023."},{"citing_arxiv_id":"2605.08401","ref_index":60,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AIPO: Learning to Reason from Active Interaction","primary_cat":"cs.CL","submitted_at":"2026-05-08T19:06:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"After training, the policy model reasons independently without relying on external collaborators, having internalized the knowledge and reasoning skills acquired through interaction. We conduct extensive experiments on diverse reasoning benchmarks, including AIME24, AIME25, MATH500 [22], LiveMathBench [40], GPQA-Diamond [53], MBPP [2], LiveCodeBench [30], and Reasoning-Gym [60]. The results demonstrate that AIPO consistently outperforms competitive baselines and achieves robust gains on both in-domain and out-of-domain evaluations. Further experiments demonstrate that AIPO generalizes across different policy models and collaborator backbones, including the Qwen [ 71] and Llama [16] families. We also show that AIPO remains"},{"citing_arxiv_id":"2605.06638","ref_index":93,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17842","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks","primary_cat":"cs.CL","submitted_at":"2026-04-20T05:51:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02909","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR","primary_cat":"cs.LG","submitted_at":"2026-04-06T15:02:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"0.00 0.25 0.50 0.75 1.00 Global FPR (a) OLMo 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FPR (b) Qwen Figure 5: Asymmetric relative error results. The training outcomes for asymmetric inter- vals are worse than the symmetric interval, although the intervals are strictly tighter. Clean FN if no \"\\[\" FN if English 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FNR (a) OLMo 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FNR (b) Qwen Figure 6: Language-based FN results. For comparison, we also include a format FN, where the verifier gives a false negative if the answer does not contain \"\\[\". ResultsIn Figure 4, we plot the relationship between a pattern's initial frequency, its conditional advantage, and the final reward. The results support our hypothesis. When a trigger pattern is frequent at initialization, its effect depends on its conditional advantage. Frequent patterns with negative conditional advantage collapse because rewarding them reinforces behavior that is actively misaligned with the task. By contrast, frequent patterns with positive conditional advantage tend to produce a plateau rather than a collapse. Although rewarding these triggers mis-specifies the objective, the induced behavior remains partially aligned with the task. When a pattern is rare at initialization, its influence is usually weaker, but conditional advantage still matters. Rare patterns with positive conditional advantage can still be amplified because they align with the oracle reward and are learned alongside correct behavior. As such, they become more frequent during training and the oracle reward plateaus. By contrast, rare patterns with negative conditional advantage become less frequent during training because they are not reinforced by the oracle reward, so they tend to move further left in the plot, and therefore do not"},{"citing_arxiv_id":"2603.15432","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gym-V: A Unified Vision Environment System for Agentic Vision Research","primary_cat":"cs.CV","submitted_at":"2026-03-16T15:37:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.04809","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning","primary_cat":"cs.AI","submitted_at":"2026-01-08T10:42:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20814","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPHINX: A Synthetic Environment for Visual Perception and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-11-25T20:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Synthetic and procedural benchmarks.Procedu- ral datasets address these limitations by enabling con- trolled variation: CVR [ 70], A-I-RA VEN, I-RA VEN- Mesh [ 33], and NTSEBench [ 38] expand RPM-style designs; IconQA [29], VisuLogic [62], and Visual Rid- dles [4] generate diagrammatic and abstraction-focused puzzles. Broader synthetic environments such as Reason- ing Gym [47], Enigmata [9], and UniBench [1] provide scalable generator-verifier frameworks. SPHINXextends this line of work by offering a diverse suite of proce- durally generated tasks, each paired with deterministic verifiers for reliable and reproducible evaluation. Reinforcement learning for visual reasoning.RL with verifiable rewards (RLVR) has shown promise for im-"},{"citing_arxiv_id":"2505.11737","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-16T22:47:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}