pith. machine review for the scientific record. sign in

arxiv: 2605.07461 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: no theorem link

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

Jiachen Yu, Junjie Wang, Yujiu Yang, Zhihao Xu

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords rubric generationLLM reasoninginstruction followingjoint supervisionself-generated rubricsgolden rubricsverifier supervisioninternal guidance
0
0 comments X

The pith

Making LLMs generate rubrics first as part of their reasoning trace, then supervising consistency with both self-created and golden rubrics, turns external evaluation into internal guidance and raises benchmark scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Think-with-Rubrics, a training approach that embeds rubric creation directly into the model's generation sequence for instruction-following tasks. The LLM produces a rubric before its response, after which a verifier supplies joint signals on how well the response aligns with both the model's own rubric and a golden reference rubric. This design moves rubrics from post-hoc scoring tools into active elements that shape the reasoning process itself. Experiments show the method outperforms the baseline of using rubrics solely as external rewards by an average of 3.87 points across benchmarks. The authors further attribute gains to improved quality of self-generated rubrics and greater internal consistency in the final outputs.

Core claim

Think-with-Rubrics has the LLM sequentially generate a rubric followed by a response; a trained rubric verifier then provides joint supervision by scoring the consistency of the response against both the self-generated rubric and the golden rubric. This integration converts the rubric from a disconnected external artifact into an internal component of the generation process, yielding measurable gains over standard rubric-as-reward training.

What carries the argument

The sequential generation of a rubric before the response, combined with joint verifier supervision on consistency to both self-generated and golden rubrics, which supplies the training signal that guides internal reasoning.

If this is right

  • Responses become more internally consistent with the evaluation criteria the model itself produced.
  • Self-generated rubrics increase in quality when the model receives supervision from both them and golden rubrics.
  • Golden-rubric supervision primarily refines the model's ability to produce useful rubrics, while self-rubric supervision strengthens response coherence.
  • The approach reduces reliance on purely post-generation reward signals for open-ended instruction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could apply the same upfront-rubric pattern at inference time to produce more self-aligned answers without any external verifier.
  • The joint-supervision pattern might generalize to other internal structures, such as explicit step-by-step plans or uncertainty estimates generated before the final answer.
  • Over repeated training cycles the model might internalize rubric creation so thoroughly that the explicit rubric step becomes optional while still retaining the consistency benefits.
  • This internal-guidance design could complement or reduce the need for separate reward models in reinforcement-learning pipelines for open-ended tasks.

Load-bearing premise

The trained rubric verifier must supply reliable consistency signals that actually improve the quality of the model's internal reasoning trace rather than merely correlating with better outputs.

What would settle it

Replace the trained verifier with a random or fixed inconsistent scorer during training; if performance then falls to or below the Rubric-as-Reward baseline, the claim that joint supervision drives the gains would be falsified.

Figures

Figures reproduced from arXiv: 2605.07461 by Jiachen Yu, Junjie Wang, Yujiu Yang, Zhihao Xu.

Figure 1
Figure 1. Figure 1: THINK-WITH-RUBRICS transforms rubrics from mere external evaluation metrics into an intrinsic part of the reasoning process. By dual supervision derived from both golden and self￾generated rubrics, it effectively enhances the model’s instruction-following capabilities. reasoning and action. This motivates us to treat rubrics not as external evaluators, but as an internal mechanism to actively guide the mod… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of THINK-WITH-RUBRICS model training. Training is conducted in two phases: SFT warm-up and RL. SFT involves distilling Gemini-2.5-Flash for format adaptation, while RL utilizes a combination of golden and self-generated rubric consistency rewards for joint supervision. • We propose THINK-WITH-RUBRICS, a paradigm that transforms rubrics from independent evaluator into an internal part of the model’… view at source ↗
Figure 3
Figure 3. Figure 3: A test case in IFEval that compares simple Rubric-as-Reward and [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The prompt format used for the Rubric-as-Reward baseline with CoT reasoning. The model [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt format used for THINK-WITH-RUBRICS models. The model is explicitly required to first enumerate hard rules and quality principles as a structured rubric before generating the final answer, mirroring the <rubric> → <answer> generation paradigm enforced during training. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy's primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model's generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM's generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Think-with-Rubrics, a paradigm for instruction-following tasks in which an LLM sequentially generates a rubric followed by a response as part of its internal reasoning trace. A trained rubric verifier supplies joint supervision by checking consistency between the response and both the self-generated rubric and golden rubrics. Experiments across multiple benchmarks report that this approach outperforms the Rubric-as-Reward baseline (supervised only by golden rubrics) by an average of 3.87 points, with additional discussion of mechanisms by which golden-rubric and self-rubric supervision improve rubric quality and internal response consistency, respectively.

Significance. If the results hold after proper validation, the work offers a concrete way to move rubrics from external post-hoc evaluators into the model's generative process, which could improve reasoning quality and alignment on open-ended tasks. The reported gains and mechanistic discussion provide a starting point for this shift, though the current evidence base is too thin to assess generalizability or the precise source of the improvement.

major comments (3)
  1. [Abstract] Abstract: the central claim of a consistent 3.87-point average improvement is stated without identifying the benchmarks, number of runs, statistical tests, model scales, or training hyperparameters, leaving the empirical support for the paradigm shift only moderately substantiated.
  2. [Method] The description of the trained rubric verifier (and its role in joint supervision): no accuracy metrics, human agreement scores, or ablation studies isolating verifier errors are reported. Because the method replaces external evaluation with verifier-mediated internal guidance, the absence of verifier validation makes it impossible to rule out that the 3.87-point margin arises from policy exploitation of verifier idiosyncrasies rather than improved reasoning traces.
  3. [Abstract] Abstract / Experiments: the mechanistic claim that golden-rubric supervision improves self-generated rubric quality while self-rubric supervision increases internal consistency is asserted without quantitative breakdowns, controls, or ablations separating the two contributions, undermining attribution of the overall gain to the proposed internal-guidance mechanism.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement distinguishing the training-time sequential generation plus verifier supervision from any inference-time usage of the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional details and analyses would strengthen the presentation of our empirical results and mechanistic insights. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a consistent 3.87-point average improvement is stated without identifying the benchmarks, number of runs, statistical tests, model scales, or training hyperparameters, leaving the empirical support for the paradigm shift only moderately substantiated.

    Authors: We agree that the abstract would benefit from greater specificity to substantiate the reported improvement. In the revised manuscript, we have expanded the abstract to explicitly name the benchmarks, report the number of runs, describe the statistical tests performed, indicate the model scales evaluated, and summarize the key training hyperparameters. These additions provide readers with the necessary context while preserving the abstract's conciseness. revision: yes

  2. Referee: [Method] The description of the trained rubric verifier (and its role in joint supervision): no accuracy metrics, human agreement scores, or ablation studies isolating verifier errors are reported. Because the method replaces external evaluation with verifier-mediated internal guidance, the absence of verifier validation makes it impossible to rule out that the 3.87-point margin arises from policy exploitation of verifier idiosyncrasies rather than improved reasoning traces.

    Authors: We acknowledge the importance of validating the rubric verifier, as its reliability is central to attributing gains to internal guidance rather than verifier artifacts. The original submission focused on the overall paradigm and did not include these metrics. In the revision, we have added the verifier's accuracy on held-out data, human agreement scores, and ablation experiments that quantify the impact of verifier errors. These results support that the observed improvements arise from enhanced reasoning traces. revision: yes

  3. Referee: [Abstract] Abstract / Experiments: the mechanistic claim that golden-rubric supervision improves self-generated rubric quality while self-rubric supervision increases internal consistency is asserted without quantitative breakdowns, controls, or ablations separating the two contributions, undermining attribution of the overall gain to the proposed internal-guidance mechanism.

    Authors: We agree that stronger quantitative evidence is needed to attribute the gains specifically to the internal-guidance mechanism. The original manuscript presented these mechanisms through discussion and overall results. We have revised the experiments section to include targeted ablations, quantitative breakdowns of rubric quality and response consistency metrics, and controls that isolate the effects of golden-rubric versus self-rubric supervision. These additions clarify the distinct contributions of each supervision signal. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical claims

full rationale

The paper advances an empirical training paradigm in which an LLM generates rubrics then responses, supervised by a separately trained verifier on consistency with self-generated and golden rubrics. Its central result is a measured 3.87-point average gain over the Rubric-as-Reward baseline across benchmarks, presented as direct experimental comparison rather than any first-principles derivation, fitted-parameter prediction, or self-referential equation. No load-bearing steps reduce by construction to the method's own inputs; the verifier is treated as an external trained component, and performance is evaluated against an independent baseline. This structure is self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the approach relies on standard LLM training assumptions without explicit free parameters, invented entities, or non-standard axioms detailed.

axioms (2)
  • domain assumption LLMs are capable of generating task-appropriate rubrics as part of reasoning
    Invoked in the description of the model sequentially generating rubric followed by response.
  • domain assumption A trained rubric verifier can accurately measure consistency between responses and rubrics
    Central to the joint supervision process described.

pith-pipeline@v0.9.0 · 5527 in / 1294 out tokens · 57345 ms · 2026-05-11T01:47:12.656276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 9 internal anchors

  1. [1]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  2. [2]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

    Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly discriminative rubric dataset via automated coarse-to-fine generation.arXiv preprint arXiv:2601.08430, 2026

  5. [5]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352,

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  6. [6]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

  7. [7]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534–46594, 2023

  8. [8]

    RewardBench 2: Advancing Reward Model Evaluation

    Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937, 2025

  9. [9]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

  10. [10]

    Infobench: Evaluating instruction following ability in large language models

    Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13025–13048, 2024

  11. [11]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  12. [12]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  13. [13]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  14. [14]

    Sciif: Benchmarking scientific instruction following towards rigorous scientific intelligence.arXiv preprint arXiv:2601.04770, 2026

    Encheng Su, Jianyu Wu, Chen Tang, Lintao Wang, Pengze Li, Aoran Wang, Jinouwen Zhang, Yizhou Wang, Yuan Meng, Xinzhu Ma, et al. Sciif: Benchmarking scientific instruction following towards rigorous scientific intelligence.arXiv preprint arXiv:2601.04770, 2026

  15. [15]

    S2j: Bridging the gap between solving and judging ability in generative reward models.arXiv preprint arXiv:2509.22099, 2025

    Shaoning Sun, Jiachen Yu, Zongqi Wang, Xuewei Yang, Tianle Gu, and Yujiu Yang. S2j: Bridging the gap between solving and judging ability in generative reward models.arXiv preprint arXiv:2509.22099, 2025

  16. [16]

    Judgebench: A benchmark for evaluating llm-based judges,

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784, 2024. 10

  17. [17]

    Viswanathan, Y

    Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tong- shuang Wu. Checklists are better than reward models for aligning language models.arXiv preprint arXiv:2507.18624, 2025

  18. [18]

    Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

  19. [19]

    Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

    Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J Zhang, Makesh N Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

  20. [20]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  21. [21]

    Learning llm-as-a-judge for preference alignment

    Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. InThe Thirteenth International Conference on Learning Representations, 2025

  22. [22]

    Improve llm-as-a-judge ability as a general ability

    Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, and Xuelong Li. Improve llm-as-a-judge ability as a general ability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14110–14126, 2025

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  24. [24]

    Compassjudger-2: Towards generalist judge model via verifiable rewards, 2025

    Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, and Kai Chen. Compassjudger-2: Towards generalist judge model via verifiable rewards.arXiv preprint arXiv:2507.09104, 2025

  25. [25]

    Firebench: Evaluating instruction following in enterprise and api-driven llm applications.arXiv preprint arXiv:2603.04857, 2026

    Yunfan Zhang, Yijie Bei, Jetashree Ravi, and Pawel Garbacki. Firebench: Evaluating instruction following in enterprise and api-driven llm applications.arXiv preprint arXiv:2603.04857, 2026

  26. [26]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  27. [27]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

  28. [28]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 11 A Appendix A.1 Limitation Despite the promising results achieved by THINK-WITH-RUBRICS, this work has two primary limitations. First, the scope of em...

  29. [29]

    Tone: Academic, reflective

    Analyze the Request: Topic: Reflective essay on life changes since starting college. Tone: Academic, reflective. Length: 30 words or less. Format: English, capitalized, brief final answer

  30. [30]

    Keep it concise

    Drafting Content: Focus on personal growth, independence, and academic challenges. Keep it concise

  31. [31]

    Refining for Length: College has transformed my life from a sheltered student into an independent adult. I have learned to manage my time, navigate complex social dynamics, and develop a deeper sense of self-awareness, all of which have fundamentally reshaped my perspective on the world

  32. [32]

    Is it in English? Yes

    Final Check: Is it 30 words or less? Yes. Is it in English? Yes. Is it capitalized? Yes. Is it brief? Yes. <answer>College has transformed my life from a sheltered student into an independent adult. I have learned to manage my time, navigate complex social dynamics, and develop a deeper sense of self-awareness, all of which have fundamentally reshaped my ...

  33. [33]

    [Hard Rule]

    The response must be in English. [Hard Rule]

  34. [34]

    [Hard Rule]

    All words in the response must be capitalized. [Hard Rule]

  35. [35]

    [Hard Rule]

    The response must be no longer than 30 words. [Hard Rule]

  36. [36]

    [Hard Rule]

    The response must provide recommendations for writing a reflective essay. [Hard Rule]

  37. [37]

    [Principle]

    The response should be concise and to the point. [Principle]

  38. [38]

    [Principle] </rubric> <answer> START WITH A CLARIFYING QUESTION

    The response should directly address the user’s request. [Principle] </rubric> <answer> START WITH A CLARIFYING QUESTION. IDENTIFY KEY CHANGES. REFLECT ON CAUSES AND IMPACTS. CONCLUDE WITH INSIGHTS. </answer> Figure 3: A test case in IFEval that compares simple Rubric-as-Reward and THINK-WITH-RUBRICS. Under the THINK-WITH-RUBRICSsetting, the model generat...

  39. [39]

    [Principle] </rubric> <answer> [Insert Answer here] </answer> Figure 5: The prompt format used for THINK-WITH-RUBRICSmodels. The model is explicitly required to first enumerate hard rules and quality principles as a structured rubric before generating the final answer, mirroring the<rubric>→<answer>generation paradigm enforced during training. 15