arxiv: 2605.03858 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Jaeyun Lee , Junyoung Koh , Zeynel Tok , Hunar Batra , Ronald Clark

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM judgesmulti-constraint instruction followingjudge evaluation benchmarkper-constraint assessmentinconsistency metricsinstruction following evaluation

0 comments

The pith

LLM judges' overall accuracy does not guarantee reliable per-constraint detection in multi-constraint tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a benchmark called MCJudgeBench to assess how well LLM judges evaluate responses against multiple individual constraints rather than judging the whole response at once. It provides instances with per-constraint gold labels of yes, partial, or no, along with controlled changes to the responses to test stability. The results indicate that strong overall performance does not translate to equal reliability across label types, especially the less common partial and no cases, and that adding reasoning to the evaluation improves accuracy but not always stability. A sympathetic reader would care because many practical applications rely on these judges to verify complex instructions with several requirements, and overlooking per-constraint issues could lead to flawed assessments.

Core claim

The paper claims that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. This motivates evaluating LLM judges at the constraint level to study these failure modes.

What carries the argument

MCJudgeBench, which supplies explicit per-constraint gold labels in yes/partial/no and controlled response-side perturbations to measure both correctness and inconsistency under prompt and response changes.

If this is right

Constraint-level evaluation is required to detect specific weaknesses in judge performance across yes, partial, and no categories.
Adding reasoning to judge prompts increases correctness but requires separate checks for stability.
Overall response judgments alone are insufficient for reliable multi-constraint assessment.
Inconsistency should be measured under both stochastic decoding and prompt/response perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications using LLM judges for tasks like content moderation or instruction verification may need additional safeguards for partial satisfaction cases.
Training data for judges could benefit from more examples of partial constraint satisfaction to improve balance across categories.
Similar per-component labeling and perturbation methods might improve evaluation benchmarks for other complex judgment tasks.

Load-bearing premise

The per-constraint gold labels are accurate and the controlled response-side perturbations isolate judge behavior without introducing unintended biases or changing the underlying constraint satisfaction in unmeasured ways.

What would settle it

An evaluation where high overall correctness in judges consistently predicts equally high correctness on partial and no cases across diverse tests would challenge the multi-dimensional reliability claim.

Figures

Figures reproduced from arXiv: 2605.03858 by Hunar Batra, Jaeyun Lee, Junyoung Koh, Ronald Clark, Zeynel Tok.

**Figure 1.** Figure 1: An example MCJudgeBench instance from the easy split, illustrating the instruction, candidate response, constraints, labels, and perturbation set of candidate response variants. broaden the difficulty range covered by the benchmark, which consists of easy, hard, and complex splits. We construct MCJudgeBench in three steps: source instance filtering, candidate construction, and human validation. Step 1: … view at source ↗

**Figure 2.** Figure 2: Constraint-level inconsistency rates by number of constraints. Models marked with view at source ↗

**Figure 3.** Figure 3: Per-label F1 for representative judges. Models view at source ↗

**Figure 4.** Figure 4: Infeasible perturbation for a structured output. view at source ↗

**Figure 6.** Figure 6: Confusion matrices for all baseline judges, with each gold-label row normalized to sum to one. Rows view at source ↗

**Figure 7.** Figure 7: Constraint-level inconsistency rates by constraint type. Models marked with view at source ↗

read the original abstract

Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCJudgeBench gives a constraint-level view of LLM judge reliability but the reported gaps rest on gold labels and perturbations whose accuracy is not shown to be validated.

read the letter

The paper's core contribution is a benchmark that scores judges on individual constraints instead of whole responses. It supplies per-constraint gold labels in yes/partial/no, adds controlled response perturbations, and tracks two kinds of inconsistency: intrinsic from sampling noise and procedural from prompt or response changes. The main empirical takeaway is that overall correctness does not guarantee good performance on the rarer partial and no labels, and that higher correctness does not always line up with lower inconsistency. Adding reasoning to the judge prompt lifts correctness but does not fix stability across the board.

Referee Report

1 major / 0 minor

Summary. The paper introduces MCJudgeBench, a benchmark for constraint-level evaluation of LLM judges in multi-constraint instruction following. Each instance consists of an instruction, candidate response, explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol tests judge stability via prompt variants and distinguishes intrinsic inconsistency (stochastic decoding) from procedural inconsistency (prompt/response perturbations). Empirical results on proprietary and open-source judges indicate that reliability is multi-dimensional: high overall correctness does not ensure reliable detection on rarer partial/no labels, correctness and inconsistency are not always correlated, and reasoning in evaluation prompts improves correctness without uniformly improving stability.

Significance. If the per-constraint labels and perturbation isolation hold, the work offers a valuable methodological contribution by moving beyond aggregate response-level judgments to granular analysis of judge behavior. The controlled perturbations and dual inconsistency metrics provide a concrete framework for diagnosing failure modes that overall metrics obscure, which could inform more reliable judge design for applications requiring precise multi-constraint verification.

major comments (1)

[Benchmark Construction and Evaluation Protocol] The headline claims about multi-dimensional reliability, weaker detection on partial/no cases, and the dissociation between correctness and inconsistency rest on the fidelity of the per-constraint gold labels and the isolation property of the response perturbations. The manuscript provides no inter-annotator agreement figures, label-source validation, or empirical checks confirming that perturbations alter only the targeted constraint without side effects on other constraints or overall semantics (see benchmark construction and evaluation protocol descriptions).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of MCJudgeBench in providing a more granular evaluation framework. We address the single major comment below and will incorporate revisions to strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: The headline claims about multi-dimensional reliability, weaker detection on partial/no cases, and the dissociation between correctness and inconsistency rest on the fidelity of the per-constraint gold labels and the isolation property of the response perturbations. The manuscript provides no inter-annotator agreement figures, label-source validation, or empirical checks confirming that perturbations alter only the targeted constraint without side effects on other constraints or overall semantics (see benchmark construction and evaluation protocol descriptions).

Authors: We agree that the reliability claims depend on the accuracy of the per-constraint gold labels and the targeted nature of the perturbations. The benchmark construction section describes the labels as human-generated per-constraint judgments in {yes, partial, no} and the perturbations as controlled, response-side modifications designed to affect only one constraint at a time. However, the manuscript indeed lacks quantitative inter-annotator agreement statistics, explicit label-source validation details, and empirical checks (e.g., semantic similarity or secondary annotations) confirming isolation with no unintended side effects on other constraints or overall response semantics. To address this directly, we will revise the benchmark construction and evaluation protocol sections to include: (1) details on the annotation process and label sourcing, (2) inter-annotator agreement figures from the original labeling, and (3) new empirical analysis verifying perturbation isolation (e.g., via automated metrics and spot-check human validation showing minimal impact on non-targeted constraints). These additions will substantiate the headline claims while leaving the core experimental results and conclusions unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation or fitting

full rationale

The paper introduces MCJudgeBench as an empirical evaluation resource consisting of instructions, responses, per-constraint gold labels in {yes, partial, no}, and controlled perturbations. It reports experimental results on judge correctness and inconsistency metrics across prompt variants but contains no equations, parameter fitting, predictions derived from inputs, or self-citation chains that reduce claims to their own definitions. All reported findings (category-specific reliability gaps, correctness-inconsistency dissociation, reasoning effects) are direct outcomes of running existing LLM judges on the constructed benchmark rather than any self-referential or fitted construction. The work is therefore self-contained with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions in LLM evaluation rather than new free parameters or invented entities.

axioms (1)

domain assumption Per-constraint gold labels in {yes, partial, no} provide a valid ground truth for evaluating judge correctness and inconsistency.
This underpins the entire benchmark construction and metric definitions.

pith-pipeline@v0.9.0 · 5479 in / 1323 out tokens · 73195 ms · 2026-05-07T16:19:52.529667+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 3 canonical work pages · 1 internal anchor

[1]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), pages 4747–4768

FollowBench: A multi-level fine-grained con- straints following benchmark for large language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long Papers), pages 4747–4768. Association for Computational Linguistics. Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, ...

work page arXiv 2024
[2]

arXiv preprint arXiv:2503.06573

WildIFEval: Instruction following in the wild. arXiv preprint arXiv:2503.06573. Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. 2025a. Mind your step (by step): Chain-of-thought can re- duce performance on tasks where thinking makes humans worse. InProceedings of the 42nd Interna- tional Conference on Machin...

work page arXiv 2025
[3]

Self-Preference Bias in LLM-as-a-Judge

Self-preference bias in LLM-as-a-judge.arXiv preprint arXiv:2410.21819. Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongn- ing Wang, and Minlie Huang. 2024. Benchmark- ing complex instruction-following with multiple con- straints composition. InAdvances in Neural I...

work page internal anchor Pith review arXiv 2024
[4]

Preserve the exact same adherence label for every constraint
[5]

Do not add, remove, or change anything that would alter a constraint label
[6]

Return only the rewritten response, or NO_SAFE_PERTURBATION

If no safe rewrite is possible, output NO_SAFE_PERTURBATION. Return only the rewritten response, or NO_SAFE_PERTURBATION. Local paraphrase.This variant allows only local wording changes, such as minor rephras- ing, sentence-level paraphrases, discourse marker changes, or local simplification and compression. It must not change the required keywords or phr...
[7]

{constraint_3} is changed to CONSTRAINTS:
[8]

A numbered list CONSTRAINTS:

{constraint_1} Constraint formatting.The constraint content is unchanged, but the presentation format of the constraint list is altered. A numbered list CONSTRAINTS:
[9]

1" text:

{constraint_3} is rewritten as a structured field format: CONSTRAINTS: - constraint_id: "1" text: "{constraint_1}" - constraint_id: "2" text: "{constraint_2}" - constraint_id: "3" text: "{constraint_3}" Template reordering.The same prompt sections are preserved, but their order is changed. The de- fault section order INSTRUCTION: {instruction} MODEL OUTPU...
[10]

Does the generated text represent a floor plan? yes
[11]

Are the symbols"_"and"|"used to represent walls? yes
[12]

Are text labels used to indicate room functions? yes
[13]

Are all text labels and interior walls fully enclosed by outer walls? no
[14]

f**k" and

Does the generated text depict a 2-bedroom, 2-bathroom apartment with a living room and a kitchen? no Reason for exclusion A label-preserving perturbation is infeasible here because the floor-plan layout is highly constrained. Meaningful changes would risk altering the wall structure, room layout, or room labels. Figure 4: Infeasible perturbation for a st...
[15]

Does the model correctly determine the game content to be R-rated? yes
[16]

Does the model respond with R-rated? yes
[17]

Does the model provide an explanatory text for the game being classified as R-ratedafter responding withR-rated? no
[18]

Does the number of characters in the explanatory text output by the model not exceed 50 characters? no
[19]

Figure 5: Infeasible perturbation for a minimal output

Does the model output explanation without any emotional attitude? no Reason for exclusion A label-preserving perturbation is infeasible here because the response must remain exactlyR-rated. Figure 5: Infeasible perturbation for a minimal output. ment version that measures the extent of disagree- ment across repeated stochastic evaluations. For each constr...