{"total":15,"items":[{"citing_arxiv_id":"2606.03980","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill","primary_cat":"cs.LG","submitted_at":"2026-06-02T17:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01091","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Research as Rubric for Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-05-31T08:25:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30244","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning with Robust Rubric Rewards","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29156","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains","primary_cat":"cs.LG","submitted_at":"2026-05-27T22:46:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RUBRIC-ARROW is an alternating rubric generator and judge framework that uses probability-based scoring and pairwise preferences to improve pointwise reward modeling accuracy for LLM post-training in non-verifiable domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27865","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment","primary_cat":"cs.CL","submitted_at":"2026-05-27T02:26:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MERIT trains a small reviewer assessor via rubric-guided RL with LLM rewards and distills it to a SOTA embedding retriever for paper-reviewer matching on LR-Bench and CMU Gold.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28882","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human","primary_cat":"cs.CL","submitted_at":"2026-05-26T16:53:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23590","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents","primary_cat":"cs.AI","submitted_at":"2026-05-22T12:59:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20164","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","primary_cat":"cs.AI","submitted_at":"2026-05-19T17:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13641","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-13T15:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09269","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification","primary_cat":"cs.CL","submitted_at":"2026-05-10T02:32:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025. [24] Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025. [25] Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025. [26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019."},{"citing_arxiv_id":"2605.07461","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-08T09:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"By decomposing task quality into a structured set of interpretable criteria, this approach provides finer-grained supervisory signals and enhanced interpretability com- pared to standard, holistic LLM-as-a-judge methods. Benchmarks such as JudgeBench [ 16] and FireBench [25] inherently utilize rubrics to assess model performance. Concurrently, works including Openrubrics[ 6] and RubricHub [4] concentrate on designing automated rubric generation frame- works, facilitating the construction of corresponding prompt-rubric paired datasets. Our research systematically builds upon the rubric paradigms and datasets established by these preceding efforts to further advance this domain. Rubric-based reward.Building upon rubric-based evaluation, an emerging line of work transforms"},{"citing_arxiv_id":"2605.07396","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rubric-based On-policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"has achieved significant breakthroughs in reasoning [19], yet its reliance on binary outcomes often restricts it to deterministic domains. To bridge this gap, structured rubrics have been introduced to decompose quality into fine-grained dimensions for open-ended tasks. While RaR [ 23] and OpenRubrics [33] focused on formalizing instance-specific rewards, Rubicon [ 34] addressed the \"seesaw effects\" between conflicting criteria. More recent works like RLER [35] and SibylSense [36] have pioneered evolving rubrics grounded in search evidence or adversarial memory to capture emergent behaviors. While prior work treats rubrics as evaluation instruments, ROPD repurposes them as a dynamic distillation interface. 6 Limitation and Future Work"},{"citing_arxiv_id":"2604.13602","ref_index":146,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"language model post-training.arXiv preprint arXiv:2509.21500, 2025. [145] Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025. [146] Kwangwook Seo and Dongha Lee. P-check: Advancing personalized reward model via learning to generate dynamic checklist, 2026. URLhttps://arxiv.org/abs/2601.02986. [147] Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for"},{"citing_arxiv_id":"2604.13029","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Preference Optimization with Rubric Rewards","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in LLM evaluation and alignment. For evaluation, customized rubrics have proven essential for complex, multi-dimensional tasks [35, 36, 44, 45, 46]. In post-training, frameworks such as Rubrics- as-Rewards [47], Rubicon [48], and CARMO [37] use rubrics to provide structured reward signals. To generate discriminative rubrics, methods like RLCF [38], OpenRubrics [49], and Rubrichub [50] employ contrastive generation. Dynamic refinement processes are introduced by OnlineRubrics [51], Auto-Rubric [52], and Rubric-ARM [53] to mitigate reward over-optimization. RuscaRL [39] further introduces rubrics for both reward modeling and hint-guided rollout generation. In the multimodal domain, rubric-based alignment remains in its early stages."},{"citing_arxiv_id":"2603.27977","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology","primary_cat":"cs.AI","submitted_at":"2026-03-30T02:54:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}