{"total":15,"items":[{"citing_arxiv_id":"2606.17682","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-16T08:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11052","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It","primary_cat":"cs.CL","submitted_at":"2026-06-09T16:17:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18721","ref_index":21,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Preference Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14220","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diagnosing Training Inference Mismatch in LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-14T00:27:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12474","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking in Rubric-Based Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-12T17:54:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"958/. [7] Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning, 2025. URLhttps://arxiv.org/abs/2507.16812. [8] Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025. [9] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 10835-10866."},{"citing_arxiv_id":"2605.11865","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Variance-aware Reward Modeling with Anchor Guidance","primary_cat":"stat.ML","submitted_at":"2026-05-12T09:46:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08496","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms","primary_cat":"cs.AI","submitted_at":"2026-05-08T21:21:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06036","ref_index":260,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Optimal Transport for LLM Reward Modeling from Noisy Preference","primary_cat":"cs.LG","submitted_at":"2026-05-07T11:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04431","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning","primary_cat":"cs.SE","submitted_at":"2026-05-06T02:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"further improve failure recovery efficiency through automated monitoring, restart orchestration, and low-cost restoration mechanisms. Another line of work focuses on algorithmic stabilization and reward-aware mitigation within RLHF itself. For instance, prior studies have investigated reward shaping and reward-model redesign to reduce reward hacking [10]- [12], while other work revisits KL regularization and stable optimization in RLHF to better balance alignment quality and training stability [13], [14]. Reinforcement Fine-Tuning LLM Expert πθ Policy (LLM) rφ Reward Model Policy Update riyi xi θt+1=θt + α▽θJ(πθ) Reward KL Entropy Return Stability Response Length …… Observable Signals Inspect Training"},{"citing_arxiv_id":"2604.21268","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding","primary_cat":"cs.LG","submitted_at":"2026-04-23T04:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17328","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction","primary_cat":"cs.LG","submitted_at":"2026-04-19T08:48:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21350","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Factored Causal Representation Learning for Robust Reward Modeling in RLHF","primary_cat":"cs.LG","submitted_at":"2026-01-29T07:18:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.19652","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Rewarding Vision-Language Model via Reasoning Decomposition","primary_cat":"cs.CV","submitted_at":"2025-08-27T08:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.17419","ref_index":170,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-02-24T18:50:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DeepSeek-GRM [166] Sampling Feedback-guidedSFT & RL Multiple Reasoning TasksInference-Time Scalability RewardAgent [167] Existing Data Feedback-guided SFT & RL NLP Tasks Human & Verifiable Signals PAR [168] Sampling Feedback-guidedSFT & RL NLP Tasks Centered Reward Shaping SCIR [169] Sampling Feedback-guided SFT & RL NLP Tasks Self-Consistency Enforcement MCTS ReST-MCTS∗[170] Sampling Self-training SFT & RL Multiple Reasoning TasksMCTS and Self-training OmegaPRM [171] MCTS with Binary Search Feedback-guided SFT Math Reasoning Divide-and-Conquer MCTS Consensus Filtering [172]MCTS Data ConstructionFeedback-guidedSFT Math Reasoning Consensus Filtering Mechanism ReARTeR [173] Sampling Feedback-guided SFT & RL QA Retrieval-Augmented Generation"},{"citing_arxiv_id":"2502.13957","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Supervising the search process produces reliable and generalizable information-seeking agents","primary_cat":"cs.CL","submitted_at":"2025-02-19T18:56:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}