pith. sign in

arxiv: 2606.03968 · v1 · pith:MGJIUH32new · submitted 2026-06-02 · 💻 cs.CL · cs.AI

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

Pith reviewed 2026-06-28 10:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords QUBRICrubric-based RLquery co-designreinforcement learningopen-ended queriesArenaHardGRPO traininginstruction following
0
0 comments X

The pith

Co-designing queries and rubrics enables rubric-based RL to work on open-ended tasks without verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that rubric quality in reinforcement learning is limited by query structure: open-ended queries produce vague rubrics, and attempts to narrow them often insert fabricated references that no response can satisfy, yielding zero reward signal. QUBRIC addresses this bottleneck through joint design, using teacher-derived key points to rewrite queries as concrete scenario-based questions, contrastive rubric generation to convert teacher-policy gaps into specific criteria, and learnability filtering to keep only informative pairs for GRPO training. When trained solely on instruction-following data, the resulting model improves ArenaHard by 5.5 points over the supervised baseline and transfers to three held-out benchmarks in legal, moral, and narrative reasoning with a 6.3-point average gain concentrated in reasoning dimensions. This demonstrates that query-rubric co-design can make rubric-based RL a practical route beyond strictly verifiable tasks.

Core claim

QUBRIC is a framework that co-designs queries and rubrics for rubric-based RL. It grounds query rewriting in teacher-derived key points to create scenario-based, evaluable questions, applies contrastive rubric generation to turn gaps between teacher and policy responses into query-level criteria, and uses learnability filtering to retain only informative query-rubric pairs for GRPO training. Applied to instruction-following data, this produces a 5.5-point gain on ArenaHard over the SFT baseline and average 6.3-point gains on three held-out benchmarks spanning legal, moral, and narrative reasoning, with improvements concentrated in reasoning-related dimensions.

What carries the argument

The QUBRIC co-design process that rewrites open-ended queries using teacher-derived key points, generates contrastive rubrics from teacher-policy gaps, and applies learnability filtering before GRPO training.

If this is right

  • Rubric-based RL becomes practical for open-ended instruction-following tasks that lack strict verifiability.
  • Training only on instruction-following data produces transferable gains on held-out domains such as legal and moral reasoning.
  • Performance gains concentrate in reasoning-related evaluation dimensions.
  • Query structure must be optimized jointly with rubrics to avoid training signals that collapse to zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The co-design principle could reduce dependence on hand-crafted verifiable rewards in broader RL pipelines.
  • Similar query rewriting and filtering steps might be tested with other policy optimization methods besides GRPO.
  • Automating extraction of the teacher key points could lower the requirement for a strong teacher model.

Load-bearing premise

Teacher-derived key points can rewrite open-ended queries into scenario-based questions without introducing fabricated references that no model response can verify.

What would settle it

Observing that the rewritten queries contain fabricated or unverifiable references, resulting in zero reward for all responses during training.

Figures

Figures reproduced from arXiv: 2606.03968 by Bing Yin, Chao Zhang, Jingfeng Yang, Priyanka Nigam, Qingyu Yin, Rongzhi Zhang, Rui Feng, Tuo Zhao, Xin Liu, Zhihan Zhang, Zixuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of QUBRIC. Module 1: Open-ended queries are rewritten into scenario￾grounded, evaluable questions via key-point extraction from multiple teacher responses and implicit-reasoning scenario construction. Module 2: Contrastive rubric generation compares teacher and student (policy) responses to extract query-level criteria. Module 3: An LLM judge grades each rollout against both query-level and global… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics under three query-rubric strategies. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces QUBRIC, a co-design framework for queries and rubrics to extend RL beyond verifiable rewards. It diagnoses that open-ended queries produce vague rubrics while naive narrowing introduces unverifiable fabricated references (yielding zero-reward signals). The proposed solution rewrites queries into scenario-based forms using teacher-derived key points, generates contrastive rubrics from teacher-policy gaps, applies learnability filtering, and trains with GRPO. Empirical claims include a +5.5 point gain on ArenaHard over SFT and +6.3 average transfer to three held-out benchmarks (legal, moral, narrative), with gains concentrated in reasoning dimensions.

Significance. If the reported gains prove robust and the rewriting step reliably yields evaluable queries with usable non-zero rubric scores, the work would provide a practical route to rubric-based RL on open-ended reasoning tasks, complementing RLVR. The co-design insight directly targets a structural bottleneck identified in the abstract.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Query-Rubric Co-Design): The central claim that teacher-derived key points plus learnability filtering produce evaluable scenario-based questions without fabricated references (ensuring non-zero reward signals) is asserted but not supported by any explicit verification, fraction of retained queries, or example showing that at least some responses receive positive rubric scores. This assumption is load-bearing for the +5.5 and +6.3 gains, as zero-reward collapse would invalidate the GRPO training results.
  2. [§4] §4 (Experiments): The reported +5.5 ArenaHard gain and +6.3 transfer average are presented without details on number of runs, standard deviations, statistical significance tests, or controls isolating the contribution of the rewriting step versus rubric generation. This makes it impossible to assess whether the improvements are robust or sensitive to post-hoc choices in query filtering.
minor comments (1)
  1. [Abstract] The abstract mentions 'learnability filtering' but provides no pseudocode or threshold definition; a short methods subsection or appendix would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments identifying the need for explicit verification of non-zero rewards and for greater statistical rigor in the experiments. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Query-Rubric Co-Design): The central claim that teacher-derived key points plus learnability filtering produce evaluable scenario-based questions without fabricated references (ensuring non-zero reward signals) is asserted but not supported by any explicit verification, fraction of retained queries, or example showing that at least some responses receive positive rubric scores. This assumption is load-bearing for the +5.5 and +6.3 gains, as zero-reward collapse would invalidate the GRPO training results.

    Authors: We agree that the manuscript would benefit from explicit verification of this load-bearing assumption. In the revision we will add the fraction of queries retained after learnability filtering together with concrete examples of query-rubric pairs and the distribution of rubric scores obtained on model responses, thereby demonstrating that a non-trivial fraction of responses receive positive scores and that zero-reward collapse is avoided. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported +5.5 ArenaHard gain and +6.3 transfer average are presented without details on number of runs, standard deviations, statistical significance tests, or controls isolating the contribution of the rewriting step versus rubric generation. This makes it impossible to assess whether the improvements are robust or sensitive to post-hoc choices in query filtering.

    Authors: We acknowledge the absence of these details. The revised §4 will report the number of independent runs, standard deviations, results of statistical significance tests, and ablation experiments that isolate the query-rewriting component from rubric generation, allowing readers to evaluate robustness and sensitivity to filtering choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential reductions

full rationale

The paper presents QUBRIC as a procedural framework involving teacher key points for query rewriting, contrastive rubric generation, learnability filtering, and GRPO training. Reported gains (+5.5 on ArenaHard, +6.3 average transfer) are framed as experimental outcomes from training on instruction-following data. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. The central claims rest on benchmark results rather than any chain that reduces by construction to its own inputs, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes reliable teacher key points and that contrastive gaps produce learnable criteria, but these are not formalized.

pith-pipeline@v0.9.1-grok · 5768 in / 1175 out tokens · 20716 ms · 2026-06-28T10:25:39.197114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Self-Instruct: Aligning language models with self-generated instructions , author=. arXiv preprint arXiv:2212.10560 , year=

  2. [2]

    Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin , journal=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  5. [5]

    arXiv preprint arXiv:2501.12948 , year=

  6. [6]

    Second Conference on Language Modeling , year=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=

  7. [7]

    International Conference on Learning Representations , year=

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , author=. International Conference on Learning Representations , year=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Defining and characterizing reward hacking , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  11. [11]

    Checklists are all you need for

    Gao, Yifan and Zheng, Zhi and Yu, Xinyan and Mishra, Swaroop and Ren, Xiang , journal=. Checklists are all you need for

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Rule based rewards for language model safety , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Yuan, Hongyi and Yuan, Zheng and Tan, Chuanqi and Wang, Wei and Huang, Songfang and Huang, Fei , journal=

  14. [14]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=

  15. [15]

    arXiv preprint , year=

  16. [16]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

  17. [17]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

  18. [18]

    From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , booktitle =

    Tianle Li and Wei. From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , booktitle =. 2025 , url =

  19. [19]

    He, Yun and Li, Wenzhe and Zhang, Hejia and Li, Songlin and Mandyam, Karishma and Khosla, Sopan and Xiong, Yuanhao and Wang, Nanshu and Peng, Xiaoliang and Li, Beibin and others , journal=

  20. [20]

    Chiu, Yu Ying and Lee, Michael S. and Calcott, Rachel and Handoko, Brandon and de Font-Reaulx, Paul and Rodriguez, Paula and Zhang, Chen Bo Calvin and Han, Ziwen and Sehwag, Udari Madhushani and Maurya, Yash and others , journal=

  21. [21]

    Shi, Yuzhen and Liu, Huanghai and Hu, Yiran and Song, Gaojie and Xu, Xinran and others , journal=

  22. [22]

    Sprague, Zayne and Ye, Xi and Bostrom, Kaj and Chaudhuri, Swarat and Durrett, Greg , booktitle=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  25. [25]

    The Fourteenth International Conference on Learning Representations , year=

    Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author=. The Fourteenth International Conference on Learning Representations , year=

  26. [26]

    The Thirteenth International Conference on Learning Representations,

    Guanting Dong and Keming Lu and Chengpeng Li and Tingyu Xia and Bowen Yu and Chang Zhou and Jingren Zhou , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  27. [27]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Checklists Are Better Than Reward Models For Aligning Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  28. [28]

    arXiv preprint arXiv:2508.16949 , year=

    Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning , author=. arXiv preprint arXiv:2508.16949 , year=

  29. [29]

    arXiv preprint arXiv:2602.14069 , year=

    Open Rubric System: Scaling reinforcement learning with pairwise adaptive rubric , author=. arXiv preprint arXiv:2602.14069 , year=

  30. [30]

    Alternating reinforcement learning for rubric-based reward modeling in non-verifiable

    Xu, Ran and Liu, Tianci and Dong, Zihan and Yu, Tony and Hong, Ilgee and Yang, Carl and Zhang, Linjun and Zhao, Tuo and Wang, Haoyu , journal=. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable

  31. [31]

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

    Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=

  32. [32]

    arXiv preprint arXiv:2510.27044 , year=

    Limits of generalization in rlvr: Two case studies in mathematical reasoning , author=. arXiv preprint arXiv:2510.27044 , year=

  33. [33]

    arXiv preprint arXiv:2601.22975 , year=

    Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text , author=. arXiv preprint arXiv:2601.22975 , year=

  34. [34]

    arXiv preprint arXiv:2601.18533 , year=

    From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation , author=. arXiv preprint arXiv:2601.18533 , year=

  35. [35]

    arXiv preprint arXiv:2509.21500 , year=

    Chasing the tail: Effective rubric-based reward modeling for large language model post-training , author=. arXiv preprint arXiv:2509.21500 , year=

  36. [36]

    Preprint, arXiv:2508.12790

    Reinforcement learning with rubric anchors , author=. arXiv preprint arXiv:2508.12790 , year=

  37. [37]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=