Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Chenyang Zhu; Kushal Chawla; Pengshan Cai; Sambit Sahu; Sangwoo Cho; Shi-Xiong Zhang; Zefang Liu

arxiv: 2606.27226 · v1 · pith:RXRJ7RYUnew · submitted 2026-06-25 · 💻 cs.AI · cs.CL

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Sangwoo Cho , Kushal Chawla , Pengshan Cai , Zefang Liu , Chenyang Zhu , Shi-Xiong Zhang , Sambit Sahu This is my paper

Pith reviewed 2026-06-26 04:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM evaluationbinary questionsinterpretable evaluationprompt optimizationfactual consistencyself-improvementSummEvalQAGS

0 comments

The pith

Decomposing LLM evaluation into generated binary questions produces interpretable scores that match human judgments and enable iterative prompt improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that evaluation criteria can be broken into atomic yes-no questions generated by a meta-prompt, with an LLM answering each independently before aggregation into multi-dimensional scores. A sympathetic reader cares because holistic LLM judges often yield opaque numbers that are hard to inspect or debug, while this decomposition supplies transparent question-level feedback. The method is shown to match or exceed baselines like UniEval and G-Eval on SummEval, Topical-Chat, and especially QAGS, while producing score distributions closer to humans and avoiding ceiling effects that collapse distinctions between outputs. The same question answers also drive iterative optimization of both evaluator and generation prompts under self-update and cross-model conditions.

Core claim

BINEVAL decomposes any task's evaluation criteria into a set of atomic binary questions via meta-prompt, obtains independent LLM verdicts on each question for a given output, and aggregates those verdicts into calibrated multi-dimensional scores together with the raw question-level answers; this yields evaluation that is more inspectable, better calibrated to human distributions, and directly usable as feedback for prompt optimization without training.

What carries the argument

The meta-prompt that turns task criteria into a covering set of atomic binary questions whose independent answers are aggregated into scores.

If this is right

BINEVAL matches or outperforms UniEval and G-Eval across SummEval, Topical-Chat, and QAGS while better matching human score distributions.
The question-level answers avoid ceiling effects and discriminate more clearly between borderline and clearly flawed outputs.
The same question feedback supports iterative prompt optimization on summarization and instruction-following tasks under both self-update and cross-model settings.
The framework is task-agnostic and training-free.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Binary-question feedback could be applied to domains such as code generation where holistic scores currently obscure which specific properties fail.
The decomposition might reduce certain forms of position or length bias that affect scalar LLM judges.
Aggregated binary answers could serve as a lightweight reward signal for reinforcement learning from human feedback loops.

Load-bearing premise

A meta-prompt can reliably produce a set of atomic, unbiased binary questions whose coverage of the intended criteria is complete enough that independent answers aggregate into calibrated scores.

What would settle it

Running the framework on the QAGS factual-consistency benchmark and finding that the aggregated scores have lower rank correlation with human ratings than a holistic LLM judge such as G-Eval.

Figures

Figures reproduced from arXiv: 2606.27226 by Chenyang Zhu, Kushal Chawla, Pengshan Cai, Sambit Sahu, Sangwoo Cho, Shi-Xiong Zhang, Zefang Liu.

**Figure 1.** Figure 1: Per-dimension score distributions on SummEval. BINEVAL shows its strongest correlation on consistency. Its distribution is closest to the human shape while still preserving useful spread; it also remains competitive on coherence and fluency, even when its calibration is slightly more conservative than human ratings. M11 M20 M10 M9 M8 M14 M15 M2 M1 M13 M12 M5 M17 M0 M23 M22 Summarization System (sorted by h… view at source ↗

**Figure 2.** Figure 2: Per-system average-score distributions on SummEval. Across the 16 summarization systems, BINEVAL (Claude) best tracks the relative ordering of systems, while the weaker baselines produce flatter and less discriminative score patterns. Spearman correlation (0.620), and even BINEVAL (gpt-oss) substantially outperforms G-Eval (gpt-oss), whose binary prompt produces too little score granularity for reliable ra… view at source ↗

**Figure 3.** Figure 3: Pairwise phi-coefficient correlation matrices within each SummEval dimension. Low off-diagonal values indicate questions capture distinct aspects of the dimension. Mean off-diagonal ϕ across all dimensions is 0.38. See Appendix E for question definitions. ment with many simpler ones—mirroring the benefits of task decomposition in prompting (Zhou et al., 2022; Khot et al., 2022). A question like “Are all n… view at source ↗

**Figure 4.** Figure 4: Illustrative SummEval consistency example. The summary contains subtle factual errors (underlined) that holistic scoring methods miss. BinEval decomposes consistency into seven binary questions, each targeting a specific error type, producing a score closely aligned with the human judgment. both G-Eval and UniEval assign a perfect consistency score of 5.0, because the summary is surface-plausible—it names … view at source ↗

**Figure 5.** Figure 5: Four illustrative SummEval examples, one per evaluation dimension. In each case, BinEval’s question decomposition produces scores closely aligned with human judgments by independently assessing multiple quality facets. Holistic methods (G-Eval, UniEval with gpt-oss) collapse to extreme scores on edge cases—short-but-correct summaries, partially readable text, or concise one-liners—because a single judgment… view at source ↗

**Figure 6.** Figure 6: Score comparisons for four illustrative SummEval examples, one per dimension. Dashed line marks the human reference. BinEval (Claude) consistently tracks human scores across all dimensions. G-Eval (GPT-4) and UniEval (T5) — the published baselines — perform reasonably, but when their evaluation formats are applied to gpt-oss without Monte Carlo sampling or fine-tuning, scores collapse on edge cases. A.2. P… view at source ↗

**Figure 7.** Figure 7: Per-dimension score distributions on Topical-Chat. BINEVAL (Claude) most closely tracks the human distributions across naturalness, coherence, engagingness, and groundedness. UniEval (T5) exhibits clear ceiling effects, especially outside engagingness; BINEVAL (gpt-oss) remains more discriminative than the other gpt-oss-based baselines; and UniEval (gpt-oss) is nearly flat across dimensions. Nucleus Decodi… view at source ↗

**Figure 8.** Figure 8: Per-system score distributions on Topical-Chat. BINEVAL (Claude) best preserves the human ordering of systems while maintaining realistic within-system spread. BINEVAL (gpt-oss) follows the broad ranking but is more conservative in absolute score level, G-Eval (gpt-oss) compresses low- and mid-performing systems, and UniEval (gpt-oss) is nearly uninformative because its scores are almost constant across sy… view at source ↗

**Figure 9.** Figure 9: supports the same conclusion in distributional form. For both CNN/DM and XSum, the human ratings are distinctly bimodal, with substantial mass near both 0 and 1. BINEVAL (Claude) is the method that most clearly preserves this structure: it keeps broad support across the full range instead of collapsing toward the top of the scale. BINEVAL (gpt-oss) is somewhat more conservative but still retains visible sp… view at source ↗

**Figure 10.** Figure 10: Per-summary scatter plots against human consistency scores on QAGS. E. Binary Questions for SummEval Tables 9–12 list the binary questions auto-generated by BINEVAL for each SummEval evaluation dimension. These are the questions referenced in Section 5.6 and [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BINEVAL turns evaluation into inspectable binary questions that also drive prompt tweaks, but the meta-prompt step that creates those questions gets no direct validation.

read the letter

BINEVAL replaces a single holistic LLM score with a set of atomic binary questions generated by a meta-prompt, then aggregates the yes/no answers into multi-dimensional scores. The same answers are fed back to improve the original task prompt or the evaluator prompt itself.

The practical part that stands out is the self-improvement loop. On IFBench they run iterative updates under both self and cross-model conditions and report gains. That moves the work beyond pure measurement into something usable for prompt engineering.

On the benchmarks the abstract cites—SummEval, Topical-Chat, QAGS—BINEVAL matches or beats UniEval and G-Eval, with stronger results on factual consistency and better alignment to human score distributions. The claim of reduced ceiling effects is plausible given how binary questions force finer distinctions.

The soft spot is exactly where the stress-test note points: nothing in the provided description shows that the meta-prompt produces questions that are atomic, unbiased, and exhaustive. No human audit, no coverage check, no ablation on question quality. If the generated questions systematically miss error types or embed the meta-prompt’s own priors, the downstream advantages over holistic judges become harder to trust. The abstract gives no statistical tests or experimental details either, so the correlations remain unverified for now.

This is for people who already run LLM judges and want something they can read and debug without extra training. It is not a foundational result, but the framing is straightforward enough that a serious referee could check the missing validation steps and the full experimental controls.

I would send it to peer review. The core idea is workable and the optimization use case is concrete; the gaps are fixable with added analysis rather than fatal.

Referee Report

2 major / 2 minor

Summary. The paper introduces BINEVAL, a framework that decomposes LLM evaluation criteria into atomic binary questions generated via a meta-prompt. An LLM answers these questions independently for each output, and verdicts are aggregated into interpretable multi-dimensional scores. Experiments on SummEval, Topical-Chat, and QAGS report that BINEVAL matches or outperforms baselines including UniEval and G-Eval (especially on factual consistency), better matches human score distributions, avoids ceiling effects, and that the question-level feedback enables iterative prompt optimization under self-update and cross-model settings.

Significance. If the empirical results hold, BINEVAL offers a training-free, task-agnostic, and interpretable alternative to holistic LLM judges, with added practical value for diagnosis and prompt self-improvement. The use of public benchmarks and the optimization experiments are clear strengths.

major comments (2)

Abstract: The claim that the meta-prompt generates 'fine-grained evaluation questions' that are atomic, unbiased, and collectively cover the target criteria (leading to calibrated scores and better discrimination) is load-bearing for all superiority claims over G-Eval and UniEval. The manuscript provides no human audit, inter-rater agreement, coverage analysis, or ablation on question quality; if the generated questions systematically under-cover error modes or embed meta-prompt priors, the subsequent independent answers and aggregation cannot deliver the reported calibration or performance gains.
Experiments section: The statements of 'competitive or superior correlations' and 'especially strong results on QAGS' require reported statistical tests, variance across runs, and ablations on the aggregation function to be load-bearing; without them the central empirical claims cannot be verified and post-hoc data handling cannot be ruled out.

minor comments (2)

Abstract: The aggregation procedure that converts independent binary verdicts into overall scores is described only at high level; a concise formal definition or pseudocode would improve reproducibility.
The paper would benefit from an explicit limitations paragraph addressing potential biases inherited from the meta-prompt LLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate revisions to strengthen the empirical grounding and transparency of the claims.

read point-by-point responses

Referee: Abstract: The claim that the meta-prompt generates 'fine-grained evaluation questions' that are atomic, unbiased, and collectively cover the target criteria is load-bearing. The manuscript provides no human audit, inter-rater agreement, coverage analysis, or ablation on question quality.

Authors: We agree this validation is important and currently missing. The manuscript relies on downstream task performance as indirect support. In revision we will add a dedicated analysis subsection with: (i) human evaluation of 200 sampled questions for atomicity and coverage of the original criteria, (ii) inter-annotator agreement (Cohen's kappa), and (iii) qualitative review for meta-prompt bias or under-covered error modes. This directly addresses the concern. revision: yes
Referee: Experiments section: Statements of competitive or superior correlations and especially strong results on QAGS require reported statistical tests, variance across runs, and ablations on the aggregation function.

Authors: We accept the point. The current version reports mean correlations without significance testing or variance. We will revise the experimental results to include: (i) paired statistical significance tests (Wilcoxon signed-rank) against baselines, (ii) standard deviation across five independent runs with different random seeds, and (iii) an ablation comparing mean, majority-vote, and weighted aggregation functions. These additions will be placed in the main results tables and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results validated against external human judgments on public benchmarks

full rationale

The paper presents an empirical framework (BINEVAL) whose central claims rest on measured correlations with human annotations on established external benchmarks (SummEval, Topical-Chat, QAGS). No equations, derivations, fitted parameters, or predictions are defined in terms of the method's own outputs. The meta-prompt step for question generation is an unverified modeling assumption rather than a self-referential loop; performance is assessed by direct comparison to independent human scores, satisfying the criterion for non-circularity. No self-citation chains or renamings of known results are load-bearing for the reported superiority.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLMs can both generate and answer fine-grained binary evaluation questions in a way that produces human-aligned verdicts; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption LLMs can generate appropriate atomic binary evaluation questions via a meta-prompt and can answer those questions independently and accurately enough to support calibrated aggregate scores.
This premise is required for the entire BINEVAL pipeline described in the abstract.

pith-pipeline@v0.9.1-grok · 5801 in / 1380 out tokens · 41491 ms · 2026-06-26T04:34:14.171112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Generalizing Verifiable Instruction Following

GitHub repository. Lin, C.-Y . ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pp. 74–81, 2004. Liu, Y ., Iter, D., Xu, Y ., Wang, S., Xu, R., and Zhu, C. G-Eval: NLG evaluation using GPT-4 with better hu- man alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1206 2004
[2]

because,

Implicit transitions are acceptable.Require logical connections but do not demand explicit cue words (“because,” “therefore”). Implicit continuity suffices if the narrative flows
[3]

Add a central-claim relevance criterion.Each sentence should advance the article’s main claim; sentences that do not contribute are non-contributory regardless of grammatical correctness
[4]

logical connections between sentences (explicit cues like ‘because, ’ ‘therefore, ’ or implicit continuity)

Do not penalize omission of background details.Missing context should not lower coherence as long as the core fact and conflict remain clear. These lessons produced targeted edits to the evaluation rubric: Table 5.Coherence prompt: key changes from iteration 0 to iteration 1. Iteration 0 (Baseline) Iteration 1 (Updated) “...logical connections between sen...
[5]

Omission ̸= inconsistency.A summary that omits details from the source is not factually inconsistent; only statements present in the summarythat are unsupported should be penalized
[6]

83rd minute

Semantic equivalence via arithmetic.Converting “83rd minute” to “seven minutes remaining” (in a 90-minute match) is a valid transformation, not a hallucination
[7]

X restarted his row with Z

Subject–role misattribution.When summaries restructure clauses, verify that entities are attached to the correct verbs (e.g., “X restarted his row with Z” misattributes if the source says “X had a row with Y and drew 0–0 with Z”). The updated prompt grew substantially (from 4 evaluation steps to 6, with detailed guidance on literal interpretation, subject...
[8]

Make the rubric stricter about coverage of essential context, not just the headline fact
[9]

Require the evaluator to check foreverykey actor,everymotivation, andeverybackground event
[10]

The resulting prompt decomposed relevance into exhaustive sub-criteria (actors, motivations, background events, factual propositions, redundancy) with a rigid penalty system

Apply quantitative penalties:−1per missing key actor,−0.5per missing motivation or background event. The resulting prompt decomposed relevance into exhaustive sub-criteria (actors, motivations, background events, factual propositions, redundancy) with a rigid penalty system. The regenerated binary questions reflected this over-specificity: Regenerated que...
[11]

Does the summary includeevery key actormentioned in the source?
[12]

Does the summary includeevery motivationfor actions stated in the source?
[13]

Does the summary includeall background eventsdirectly relevant to the headline?
[14]

Does the summary containevery other factual proposition(dates, locations, amounts)?
[15]

did the summary capture the gist?

Does the summary avoid irrelevant or redundant information? Why it fails:Human annotators use aholisticjudgment for relevance—“did the summary capture the gist?”—with soft tolerance for missing minor details. The updated questions demandexhaustivecoverage, causing the model to rate almost all summaries as deficient. The resulting scores are systematically...

arXiv

[1] [1]

Generalizing Verifiable Instruction Following

GitHub repository. Lin, C.-Y . ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pp. 74–81, 2004. Liu, Y ., Iter, D., Xu, Y ., Wang, S., Xu, R., and Zhu, C. G-Eval: NLG evaluation using GPT-4 with better hu- man alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1206 2004

[2] [2]

because,

Implicit transitions are acceptable.Require logical connections but do not demand explicit cue words (“because,” “therefore”). Implicit continuity suffices if the narrative flows

[3] [3]

Add a central-claim relevance criterion.Each sentence should advance the article’s main claim; sentences that do not contribute are non-contributory regardless of grammatical correctness

[4] [4]

logical connections between sentences (explicit cues like ‘because, ’ ‘therefore, ’ or implicit continuity)

Do not penalize omission of background details.Missing context should not lower coherence as long as the core fact and conflict remain clear. These lessons produced targeted edits to the evaluation rubric: Table 5.Coherence prompt: key changes from iteration 0 to iteration 1. Iteration 0 (Baseline) Iteration 1 (Updated) “...logical connections between sen...

[5] [5]

Omission ̸= inconsistency.A summary that omits details from the source is not factually inconsistent; only statements present in the summarythat are unsupported should be penalized

[6] [6]

83rd minute

Semantic equivalence via arithmetic.Converting “83rd minute” to “seven minutes remaining” (in a 90-minute match) is a valid transformation, not a hallucination

[7] [7]

X restarted his row with Z

Subject–role misattribution.When summaries restructure clauses, verify that entities are attached to the correct verbs (e.g., “X restarted his row with Z” misattributes if the source says “X had a row with Y and drew 0–0 with Z”). The updated prompt grew substantially (from 4 evaluation steps to 6, with detailed guidance on literal interpretation, subject...

[8] [8]

Make the rubric stricter about coverage of essential context, not just the headline fact

[9] [9]

Require the evaluator to check foreverykey actor,everymotivation, andeverybackground event

[10] [10]

The resulting prompt decomposed relevance into exhaustive sub-criteria (actors, motivations, background events, factual propositions, redundancy) with a rigid penalty system

Apply quantitative penalties:−1per missing key actor,−0.5per missing motivation or background event. The resulting prompt decomposed relevance into exhaustive sub-criteria (actors, motivations, background events, factual propositions, redundancy) with a rigid penalty system. The regenerated binary questions reflected this over-specificity: Regenerated que...

[11] [11]

Does the summary includeevery key actormentioned in the source?

[12] [12]

Does the summary includeevery motivationfor actions stated in the source?

[13] [13]

Does the summary includeall background eventsdirectly relevant to the headline?

[14] [14]

Does the summary containevery other factual proposition(dates, locations, amounts)?

[15] [15]

did the summary capture the gist?

Does the summary avoid irrelevant or redundant information? Why it fails:Human annotators use aholisticjudgment for relevance—“did the summary capture the gist?”—with soft tolerance for missing minor details. The updated questions demandexhaustivecoverage, causing the model to rate almost all summaries as deficient. The resulting scores are systematically...

arXiv