{"total":15,"items":[{"citing_arxiv_id":"2606.20881","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study","primary_cat":"cs.AI","submitted_at":"2026-06-18T19:15:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17682","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-16T08:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11867","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training","primary_cat":"cs.DC","submitted_at":"2026-06-10T09:42:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11052","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It","primary_cat":"cs.CL","submitted_at":"2026-06-09T16:17:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08088","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-06T10:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ConSteer-RL adds a confidence-aware reward derived from per-token probabilities to GRPO-based RLVR and reports 2.3-4% average gains over baselines across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05784","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents","primary_cat":"cs.AI","submitted_at":"2026-06-04T07:15:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAPO corrects credit misassignment in RL for multimodal search agents by using tool parameter similarity to share advantages across equivalent actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30478","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback","primary_cat":"cs.SE","submitted_at":"2026-05-28T18:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLVR with combined unit-test and static-analysis rewards improves pass@1 by up to 13pp on MBPP for 0.6B-1B models, while single-reward variants can induce shorter but less correct outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24375","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distilling Game Code World Model Generation into Lightweight Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-23T03:30:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SFT followed by RLVR on Qwen2.5-3B-Instruct raises syntactic and execution correctness when generating Game Code World Models across 30 games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":101,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"runtime feedback rather than pure next-token prediction. Along the same direction, systems such as CYCLE [98] and Self-Edit [99] iteratively revise generated solutions using execution-aware correction signals. Reinforcement learning further strengthens this paradigm by treating execution feedback as an optimization signal over reasoning trajectories. Methods such as CodeRL [100], CodeRL+ [101], and RLTF [102] optimize functional correctness through unit-test-based rewards, while approaches such as StepCoder [103] incorporate fine-grained compiler and runtime feedback during optimization. RLEF [104] formalizes this interaction as policy optimization grounded in multi-step execution feedback, allowing reasoning policies to adapt through iterative runtime interaction."},{"citing_arxiv_id":"2605.15012","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance","primary_cat":"cs.LG","submitted_at":"2026-05-14T16:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Large Language Models (LLMs) to the forefront of AI research [ 41], a new RL paradigm has emerged. Following OpenAI o1 [37] and DeepSeek-R1 [28], Reinforcement Learning with Verifiable Rewards (RLVR) [28] has become the second dominant RL paradigm in the community. Unlike RLHF, which assigns rewards based on subjective and often vague human preferences [78], RLVR leverages objective, verifiable rewards-such as unit tests for coding [40] or ground-truth comparisons for mathematics [18]. Consequently, RLVR is exceptionally well-suited for reasoning-heavy tasks. Driven by this approach, state-of-the-art LLMs have attained gold-medal performance in international competitions [34] and are beginning to tackle open problems at the frontiers of human knowledge [61]. Despite its impressive performance, RLVR is beset by a long-standing challenge in reinforcement"},{"citing_arxiv_id":"2605.06111","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs","primary_cat":"cs.SE","submitted_at":"2026-05-07T12:24:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASTOR improves a single code LLM across four tasks by 9.0-9.5% over the best specialist and 7.5-12.8% over prior multi-task RL baselines via utility-driven data scheduling and adaptive KL regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16995","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-18T13:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sequence y− receives a negative gradient update under the training objective, effectively reducing its log-probability. This corresponds to a logit-space update of the form logp(y −)←logp(y −) +η, η <0(14) At the sequence level, the normalized model distribution over all candidate sequencesY may be represented as P(y) = exp(logp(y))P y′∈Y exp(logp(y ′)) (15) After the update toy −, the new distribution becomes P ′(y) = exp(logp(y)) exp(logp(y −) +η) + P y′̸=y− exp(logp(y ′)) (16) For anyy̸=y −, we obtain P ′(y) = P(y) 1 +P(y −) (eη −1) (17) If the penalized sequence is already extremely unlikely,i.e. P(y −)≪1(18) then a first-order expansion yields P ′(y)≈P(y) \u0002 1−P(y −) (eη −1) \u0003 (19) Sinceη <0impliese η −1<0, it follows that"},{"citing_arxiv_id":"2604.02709","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy","primary_cat":"cs.CL","submitted_at":"2026-04-03T04:06:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. 2025. CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment.CoRRabs/2510.18471 (2025). [30] Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, and Ge Li. 2025. ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation. InICSE. IEEE, 334-346. [31] Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-Planning Code Generation with Large Language Models.ACM Trans. Softw. Eng. Methodol.33, 7 (2024), 182:1-182:30. [32] Xue Jiang, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Ge Li, and Yihong Dong."},{"citing_arxiv_id":"2604.01799","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning","primary_cat":"cs.SE","submitted_at":"2026-04-02T09:13:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"PRLCoder [65] further advances this by introducing process-supervised RL to guide the step-by-step reasoning of code synthesis. In the broader scope, RL has also been applied to program fuzzing, where CovRL [13] learn to mutate inputs to maximize feedback signals. Monotone submodularity has been successfully applied to a wide range of AI and ML problems, such as influence maximization in social networks [ 26, 27], sensor placement and information- gathering in graphical models [ 29], data subset selection and core-set construction [ 38], and document summarization [33] with diversity and coverage constraints. TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning 19 9 Conclusion and Future Work In this paper, we tackle the coverage plateau in automated test generation by fundamentally"},{"citing_arxiv_id":"2603.29957","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Think Anywhere in Code Generation","primary_cat":"cs.SE","submitted_at":"2026-03-31T16:24:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}