{"total":15,"items":[{"citing_arxiv_id":"2605.20745","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering","primary_cat":"cs.LG","submitted_at":"2026-05-20T05:48:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01831","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences","primary_cat":"cs.CL","submitted_at":"2026-05-03T11:45:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RMGAP benchmark shows state-of-the-art reward models reach at most 49.27% Best-of-N accuracy when forced to select responses matching diverse preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13602","ref_index":133,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"By formulating reward prediction as a generate-then-judge process [ 89] or leveraging test-time compute to improve reliability [131], these models externalize their evaluation logic. However, outcome-only supervision for generative reward models is insufficient, as they may produce correct numerical scores based on unsound rationales. Therefore, methods like RM-NLHF [132] and Rationale-RM [133] strictly enforce rationale-level alignment to ensure the reward model judges for the right reasons. 20 Reward Hacking in the Era of Large Models Fudan NLP Group A related failure mode appears in tool-augmented and RLVR settings, where outcome-only rewards may treat a correct final answer as sufficient evidence of successful reasoning, inadvertently rewarding guessing or reasoning-answer"},{"citing_arxiv_id":"2604.11626","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time","primary_cat":"cs.AI","submitted_at":"2026-04-13T15:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RationalRewards recovers rationales from preference data via PARROT to create a critique-first reward model that improves visual generators at both training time through RL and test time through prompt refinement, matching RL fine-tuning performance while using far less data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11611","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation","primary_cat":"cs.CL","submitted_at":"2026-04-13T15:18:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10701","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-12T15:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07506","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework","primary_cat":"cs.AI","submitted_at":"2026-04-08T18:46:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07484","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training","primary_cat":"cs.AI","submitted_at":"2026-04-08T18:25:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConsistRM improves generative reward models via consistency-aware self-training, outperforming vanilla RFT by 1.5% on average across five benchmarks and four base models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05517","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-07T07:15:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniCreative uses reference-free RL with an adaptive constraint-aware reward model to unify long-form coherence and short-form creativity in AI writing, producing an emergent ability to switch between planning and direct generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16335","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents","primary_cat":"cs.LG","submitted_at":"2026-03-13T02:23:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.24235","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling","primary_cat":"cs.LG","submitted_at":"2025-10-28T09:43:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.14232","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models","primary_cat":"cs.LG","submitted_at":"2025-10-16T02:19:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.00084","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs","primary_cat":"cs.LG","submitted_at":"2025-08-27T06:51:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01937","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RewardBench 2: Advancing Reward Model Evaluation","primary_cat":"cs.CL","submitted_at":"2025-06-02T17:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07062","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[85] Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Understanding and verifying chain-of-thought reasoning in multimodal mathematics.arXiv preprint arXiv:2501.04686, 2025. [86] Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024. 37 [87] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advancesin Neural Information Processing Systems, 36:46212-46244, 2023. [88] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning."}],"limit":50,"offset":0}