arxiv: 2603.04738 · v2 · submitted 2026-03-05 · 💻 cs.CL

Recognition: no theorem link

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen , Yilin Niu , Cunxiang Wang , Xiaoying Ling , Ying Zhang , Pei Ke , Hongning Wang , Minlie Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords instruction-followingjudge modelsmeta-evaluation benchmarkpreference graphslistwise evaluationLLM alignmentreward modeling

0 comments

The pith

A new benchmark using preference graphs exposes deficiencies in current judge models for instruction-following and correlates more strongly with downstream task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Instruction-following in large language models relies on judge models to provide scalable feedback during alignment training. Existing meta-evaluation benchmarks suffer from narrow coverage of instruction types and rely on simple pairwise comparisons that do not match how models are optimized in practice. IF-RewardBench addresses these gaps by constructing, for each instruction, a full preference graph that encodes all pairwise preferences among multiple responses according to instruction-following quality. This structure supports listwise ranking tests that better reflect real-world use of judges. Experiments on the benchmark identify clear weaknesses in existing judge models while showing tighter alignment with actual downstream performance gains.

Core claim

IF-RewardBench constructs a preference graph for each instruction containing all pairwise preferences among multiple responses based on instruction-following quality, enabling a listwise evaluation paradigm that assesses judge models' ranking capabilities essential for guiding model alignment; extensive experiments reveal significant deficiencies in current judge models and demonstrate that the benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks.

What carries the argument

Preference graphs that encode every pairwise preference among multiple responses for a given instruction, enabling listwise rather than pairwise evaluation of judge models.

If this is right

Alignment procedures for large language models can use listwise feedback from improved judges to achieve more reliable instruction-following gains.
Meta-evaluation benchmarks should shift toward full preference graphs to match the ranking demands of actual model optimization.
Current judge models require targeted improvements to handle diverse constraint types without systematic errors.
Stronger benchmark correlation implies that progress measured on IF-RewardBench will translate more directly into measurable task performance.
Developers can diagnose specific judge weaknesses across instruction categories and select or fine-tune models accordingly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Graph-based listwise evaluation could extend to other model capabilities such as multi-step reasoning or safety constraints.
If the correlation holds, alignment research may reduce reliance on costly human preference collection by trusting automated listwise judges more.
The approach highlights a path toward training judge models directly on listwise ranking objectives rather than pairwise signals.
Automated generation of preference graphs without manual sampling could further scale the benchmark to new domains.

Load-bearing premise

The constructed preference graphs accurately reflect true instruction-following quality without systematic bias introduced by the graph-building process or response sampling.

What would settle it

An experiment showing that judge models ranking highly on IF-RewardBench produce no improvement or even worse results on downstream instruction-following tasks would falsify the stronger correlation claim.

Figures

Figures reproduced from arXiv: 2603.04738 by Bosi Wen, Cunxiang Wang, Hongning Wang, Minlie Huang, Pei Ke, Xiaoying Ling, Yilin Niu, Ying Zhang.

**Figure 2.** Figure 2: Overall framework of IF-RewardBench. Left: Collect instructions and responses from diverse sources. Center: Curate preference graphs via multi-stage annotation and verification. Right: Assess various judge models based on different evaluation paradigms. J contains the ground truth following judgement j ∗ ik for each response yi to each constraint ck, providing the basis for assessing Verification. And the… view at source ↗

**Figure 3.** Figure 3: The distribution of constraint categories and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The performance of judge models in verifica [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Various factors that influence the performance of judge models in ranking. "CA" and "OA" denote [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise accuracy of judge models in various [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Somers’ D correlation between the perfor [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes IF-RewardBench, a meta-evaluation benchmark for instruction-following judge models. For each instruction it builds a preference graph containing all pairwise preferences among multiple responses according to instruction-following quality, enabling listwise rather than pairwise evaluation. Experiments on the benchmark are said to expose significant deficiencies in current judge models and to show stronger positive correlation with downstream task performance than existing benchmarks.

Significance. If the preference graphs constitute unbiased ground truth, the benchmark would address documented gaps in coverage and evaluation paradigm of prior meta-evaluation suites and could improve the reliability of feedback signals used in LLM alignment. The reported stronger downstream correlation, if robust, would constitute a concrete advance in meta-evaluation methodology.

major comments (2)

[Abstract and Section 3] Abstract and Section 3 (Benchmark Construction): the claim that the preference graphs provide reliable ground truth for instruction-following quality rests on unspecified response sampling strategy, generation models, and label source (human, model, or hybrid). Any correlation between these choices and surface features (length, style, source model) would render the reported judge deficiencies and correlation gains artifacts of graph construction rather than genuine improvements.
[Experiments] Experiments section: no description is given of statistical controls, variance estimation, or multiple-testing correction for the downstream-task correlation comparisons. Without these, the assertion that IF-RewardBench achieves a stronger positive correlation cannot be evaluated for robustness.

minor comments (2)

[Abstract] The GitHub link is provided but the manuscript does not state the exact commit or data version used for the reported results; reproducibility would benefit from an explicit version tag.
[Section 4] Notation for listwise ranking metrics is introduced without an explicit equation or pseudocode; a small table or equation block would clarify the evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional statistical details.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and Section 3 (Benchmark Construction): the claim that the preference graphs provide reliable ground truth for instruction-following quality rests on unspecified response sampling strategy, generation models, and label source (human, model, or hybrid). Any correlation between these choices and surface features (length, style, source model) would render the reported judge deficiencies and correlation gains artifacts of graph construction rather than genuine improvements.

Authors: We agree that the current description in Section 3 leaves the sampling strategy, generation models, and label source insufficiently specified, which could raise questions about potential artifacts. In the revised version we will add explicit details: responses were sampled from a diverse set of open-source and proprietary models with controlled temperature and length constraints; labels were obtained via human annotation following a standardized rubric for instruction-following quality; and we will include an analysis showing that preference edges do not correlate strongly with surface features such as length or model identity. These additions will allow readers to evaluate the ground-truth reliability directly. revision: yes
Referee: [Experiments] Experiments section: no description is given of statistical controls, variance estimation, or multiple-testing correction for the downstream-task correlation comparisons. Without these, the assertion that IF-RewardBench achieves a stronger positive correlation cannot be evaluated for robustness.

Authors: We acknowledge the absence of explicit statistical controls in the current Experiments section. While the reported correlations were computed across multiple downstream tasks using both Pearson and Spearman coefficients, we did not document variance estimation or multiplicity adjustments. In the revision we will insert a dedicated statistical analysis paragraph that reports bootstrap standard errors for the correlation coefficients and applies FDR correction across the set of benchmark comparisons, thereby allowing readers to assess the robustness of the claimed improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper constructs IF-RewardBench via preference graphs over multiple responses per instruction and reports empirical results on judge model deficiencies plus correlations to downstream tasks. These outcomes are obtained through external evaluation and comparison rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation that reduces the central claims to the benchmark inputs by construction. The abstract and description contain no equations or derivations that equate outputs to inputs; the listwise ranking paradigm and correlation measurements are independent measurements against held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are identifiable from the provided text. Benchmark construction choices (instruction sampling, response generation, preference definition) are implicit but not detailed.

pith-pipeline@v0.9.0 · 5501 in / 973 out tokens · 37688 ms · 2026-05-15T17:02:59.638260+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following
cs.CL 2026-05 conditional novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

How to evaluate reward models for RLHF. In The Thirteenth International Conference on Learning Representations. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, and 1 others. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. Bosi Wen, Pei Ke, Xiaotao...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Day- iheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024a. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Instruction-Following Evaluation for Large Language Models

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. RMB: Com- prehensively benchmarking reward models in LL...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

LLMBar" denotes

framework. G.3 Error Analysis in Instruction Hierarchy The instruction hierarchy, which establishes a prior- ity order from system prompts to user prompts, is essential for ensuring consistent and safe behavior of LLMs (Zhang et al., 2025d; Qin et al., 2025a). However, we observe that when conflicts arise be- tween system and user prompts, judge models of...

work page 2024
[6]

Your answer should contain exactly 4 paragraphs separated by 3 asterisks: ***

work page
[7]

Chosen Response (Response B) Writing is a craft that thrives on habit and observation

In your response, words with all capital letters should appear at least 2 times. Chosen Response (Response B) Writing is a craft that thrives on habit and observation. By dedicating time each day to the act of putting thoughts on paper, you lay the groundwork for continuous growth. *** * Establish a daily writing habit, even if it’s just a few minutes. PR...

work page
[8]

List symbols: System requires using *, User requires using -

work page
[9]

List quantity: System requires 3, User requires 5

work page
[10]

Paragraph quantity: System requires 4 paragraphs, User requires 2 paragraphs

work page
[11]

Paragraph separators: System requires using ***, User requires using —

work page
[12]

Comparison and analysis:

Uppercase letter usage: System requires at least using 2 full uppercase words, User requires must not use full uppercase words. Comparison and analysis:

work page
[13]

It provided 5 list items starting with -, divided into 2 parts, and used — to separate

Assistant A completely followed the last round of user prompt. It provided 5 list items starting with -, divided into 2 parts, and used — to separate. At the same time, it completely did not use uppercase letters (even for safety, adopted full lowercase), precisely satisfying the user’s format requirements for this blog post

work page
[14]

PROACTIVE

Assistant B completely followed the system prompt, while ignoring the format modifications explicitly proposed by the user prompt. It provided 3 list items starting with *, divided into 4 parts, and used *** to separate, also used "PROACTIVE" and "FOCUS", two full uppercase words. In handling instructions where conflicts exist, Assistant A chose to follow...

work page
[15]

Assistant A: Completely satisfied user’s format and content constraints, output structure consistent with requirements

work page
[16]

Assistant B: Format and wording multiple places not matching (wrong separators and item symbols, and contained uppercase words), deviated from instructions

work page
[17]

- " item symbols, and content was divided into using

Judgment reason: With instruction following as priority, Assistant A on key constraints no violations, obviously better. *** Summarizing above mentioned, choose Assistant A. *** [[A]]] Overall Assessment of GLM-4.6 Assistant A’s response strictly followed the last round user prompt’s all format requirements: it contained exactly 5 starting with "- " item ...

work page
[18]

Your response should contain fewer than 2 sentences

work page
[19]

Algiers offers a perfect blend of Mediterranean beaches, rich history, vibrant culture, and delicious cuisine that creates an unforgettable vacation experience

Your response should contain at least 1 sentence. Chosen Response (Response A) "Algiers offers a perfect blend of Mediterranean beaches, rich history, vibrant culture, and delicious cuisine that creates an unforgettable vacation experience." Golden Judgements of Chosen Response [1, 1, 1] Rejected Response (Response B) ’Algiers offers a unique blend of Med...

work page
[20]

satisfy all constraints mentioned above

Vague Constraints: The checklist includes ambiguous constraints, such as "satisfy all constraints mentioned above" or "your response must be correct"

work page
[21]

Omissions: The checklist misses mandatory constraints present in the system prompt, conversation history, or user instruction

work page
[22]

Expired Constraints: The checklist includes outdated constraints from the history or system prompt that no longer apply to the current round

work page
[23]

Fabricated Constraints: The checklist contains constraints not presented in the system prompt, history, or user instruction

work page
[24]

list the key points in an ordered list,

Incorrect Paraphrasing: The checklist inaccurately interprets a constraint (e.g., the original instruction states "list the key points in an ordered list," but the checklist incorrectly paraphrases it as "list the key points in an unordered list"). Step 2: Checklist Modification If you select "Incorrect" in Step 1, modify the checklist to make it correct....

work page
[25]

Source Integrity: All checklist items must originate from the provided information

work page
[26]

Completeness and Accuracy: Do not omit necessary constraints, duplicate existing ones, or fabricate new ones

work page
[27]

Followed

Atomicity: Each constraint must be atomic yet possess complete semantics, ensuring it can be fully understood without referencing other constraints. {examples} [System Prompt] {system_prompt} [Conversation History] {history} [Final Round User Instruction] {user_prompt} [Constraint Checklist] {checklist} Your judgment of the checklist quality: {option} A. ...

work page
[28]

Please analyze whether the response follows each constraint listed in the given checklist, providing a judgment for each constraint respectively

work page
[29]

Followed

Your judgments must be strict. Only responses that fully satisfy a constraint can be judged as "Followed". If there is any omission or error regarding a constraint, it must be judged as "Not Followed"

work page
[30]

It is unnecessary to consider whether the response follows any other constraints beyond the checklist

Please focus exclusively on the constraints within the given checklist. It is unnecessary to consider whether the response follows any other constraints beyond the checklist

work page
[31]

When judging the following of each constraint, your judgement should consider the complete context of the instructions, rather than interpreting the constraint in isolation. {in-context examples} [System Prompt] {system_prompt} [Coversation History] {history} [Final Round User Instruction] {user_prompt} [Model Response] {model_response} [Constraint Checkl...

work page
[34]

Your task is to evaluate the quality of the final round user instruction based on the following criteria

The final round user instruction. Your task is to evaluate the quality of the final round user instruction based on the following criteria. ## Evaluation Criteria: 1.Low Quality: The user instruction contains significant issues, such as inconsistent, incomplete, or ambiguous information, making it impossible to determine the intent behind the instruction....

work page
[35]

You do not need to answer the user instruction, but only need to output the evaluation result

work page
[36]

"" Analysis: . . . Prompt Quality:Low Quality/Medium Quality/High Quality

If the user instruction requires additional retrieval or the use of tools to obtain an answer, it should be evaluated asLow Quality. Here are some examples and the user instruction to be evaluated: [The Start of Examples] {in_context_examples} [The End of Examples] Below are the system prompt, conversation history, and final round user instruction: [The S...

work page
[40]

Score: [[5]]

A checklist: Lists all constraints that the AI assistant needs to satisfy when generating a response to the final round user instruction. Your task is to evaluate the difficulty of the final round user instruction. This instruction contains multiple con- straints that need to be followed. Your evaluation of its difficulty needs to comprehensively consider...

work page
[43]

The final round user instruction. Your task is to extract all constraints that need to be followed when generating a response to the final round user instruction, from the system prompt, conversation history, and the final round user instructions.Constraintsrefer to all instructional content, excluding auxiliary information such asbackground knowledge,tex...

work page
[44]

To reiterate:Constraintsrefer to all instructional content, excluding auxiliary information such asbackground knowledge,text materials, andin-context examples

You must extract and output all constraints in the order in which they appear in the instruction, without omitting any constraints or inventing constraints not present in the instruction. To reiterate:Constraintsrefer to all instructional content, excluding auxiliary information such asbackground knowledge,text materials, andin-context examples

work page
[45]

Each portion of the given instruction should appear in at most one constraint, and must not be repeated across multiple constraints

Do not output duplicate constraints. Each portion of the given instruction should appear in at most one constraint, and must not be repeated across multiple constraints

work page
[46]

Multiple dependent constraints should be merged into a single constraint

Each constraint should be atomic and independent, without inclusion or dependency relationships. Multiple dependent constraints should be merged into a single constraint. At the same time, ensure appropriate granularity of constraints: neither too fine nor too coarse. Each constraint should have complete semantics and be understandable on its own, without...

work page
[47]

satisfy the following requirements

Vague or general phrases in the instructions, such as "satisfy the following requirements" or "complete the following tasks", should not be considered as constraints and should be excluded from extraction. Below are the system prompt, conversation history, and final round user instruction: [The Start of System Prompt] {system_prompt} [The End of System Pr...

work page
[48]

A system prompt (which may be empty): Specifies the response rules or behavioral guidelines that the AI assistant needs to satisfy throughout the dialogue

work page
[49]

Conversation history between the user and the AI assistant (which may be empty): Presents the dialogue process between the user and the AI assistant, which consists of multiple rounds of user instructions and AI assistant responses

work page
[50]

The final round user instruction

work page
[51]

The final round assistant response

work page
[52]

"" Remember the following points: (1) Your judgment should be as strict as possible. You should only output

A checklist: Lists all constraints that the AI assistant needs to satisfy when generating a response to the final round user instruction. Your task is to think carefully and provide a detailed analysis of whether the final round assistant response follows each constraint on the checklist. You need to analyze and judge every constraint in the checklist. Do...

work page