Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Haoran Xu

arxiv: 2606.07834 · v1 · pith:LVSTAX4Snew · submitted 2026-06-05 · 💻 cs.SE · cs.AI· cs.CL· cs.MA

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Haoran Xu This is my paper

Pith reviewed 2026-06-27 20:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.MA

keywords LLM judgesmixed evidenceconflicting claimsCherry-pick Overridedirectional commitmentAVeriTeCfact verificationverdict safety

0 comments

The pith

LLM judges return directional verdicts on more than 84% of mixed-evidence claims even when CONFLICTING is the authorized option.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Cherry-pick Override as the unauthorized choice of SUPPORTS or REFUTES on claims with both supporting and refuting sources when the schema lists CONFLICTING as the non-directional verdict. On AVeriTeC's Conflicting subset of 150 examples, three-option judges exhibit this behavior in over 84% of cases. The authors apply a same-denominator diagnostic with matched-coverage bootstrap and a random-veto null, then test an intervention ladder of typed vocabulary, panel aggregation, confidence thresholding, and validator filtering. Each fix leaves a distinct residual failure, leading to the argument for an external commitment-control layer that treats structural evidence and confidence as orthogonal channels.

Core claim

Under an explicit task contract that exposes CONFLICTING as the authorized non-directional verdict, three-option LLM judges return a directional verdict on more than 84% of mixed-evidence claims in AVeriTeC's Conflicting subset. Three-judge majority voting raises the rate to 0.887 on AVeriTeC while leaving it unchanged on VitaminC-Mixed. Panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases, the panel's calibration on pure cases prevents confidence from isolating CCO, and validator filtering nearly halves pure-evidence accuracy. A minimal two-channel reference probe achieves targeted promotion to CONFLICTING under the random-veto null on AVeriTeC with empirical p

What carries the argument

Cherry-pick Override (CCO): the return of SUPPORTS or REFUTES on mixed evidence when CONFLICTING is the schema-authorized non-directional response.

If this is right

Three-judge majority voting increases direction-on-conflict from 0.840 to 0.887 on AVeriTeC but does not replicate on VitaminC-Mixed.
Panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases.
The panel's ECE of 0.07 on pure-S/R cases means confidence thresholding cannot separate CCO from correct directional commits.
Validator-as-classifier nearly halves accuracy on pure-evidence cases.
A two-channel reference probe promotes to CONFLICTING in a structurally targeted manner under the random-veto null on AVeriTeC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed fact-verification pipelines may need an explicit external authorization step before accepting any directional verdict.
The replication gap between AVeriTeC and VitaminC-Mixed suggests CCO severity depends on how evidence conflicts are distributed in a dataset.
The same override pattern could appear in LLM decision systems asked to take sides on contested policy or legal questions.
Training data that penalizes directional answers on mixed-evidence examples could be tested as a direct mitigation.

Load-bearing premise

The evaluation schema explicitly designates CONFLICTING as the only authorized non-directional verdict for mixed evidence, so any SUPPORTS or REFUTES counts as an unauthorized directional commitment.

What would settle it

Re-running the three-option judges on the identical AVeriTeC Conflicting subset after adding an explicit prompt instruction that directional answers are forbidden on mixed evidence and measuring whether the directional rate falls below 50%.

Figures

Figures reproduced from arXiv: 2606.07834 by Haoran Xu.

**Figure 1.** Figure 1: Apples-to-apples random Stage-1 control on both datasets. The controller’s Stage-1 promotes k of L3’s directional commits to CONFLICTING (k=16 on AVeriTeC, k=10 on VitaminC-Mixed); the null draws 2000 random subsets of the same size from L3’s commits and promotes them to CONFLICTING as well. (An earlier control that demoted commits to NO-COMMIT cannot change RecC mechanically; this fair control removes tha… view at source ↗

read the original abstract

LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM judges pick a side on mixed evidence over 84% of the time in one dataset and tests why common fixes fall short, but the 'unauthorized commitment' framing depends on an unshown prompt contract.

read the letter

The main point is that three-option LLM judges return SUPPORTS or REFUTES on more than 84% of AVeriTeC conflicting claims instead of CONFLICTING, and majority voting can increase that rate while a two-channel probe shows some selectivity under a random-veto null. The work defines Cherry-pick Override under an explicit task contract and pairs it with a same-denominator protocol, matched bootstrap, and intervention ladder that leaves measurable residuals like 48% suppression of CONFLICTING dissent under panels.

What stands out as new is the named failure mode plus the apples-to-apples null model that lets them claim a selectivity result rather than just a magnitude one. The calibration check on pure evidence and the point that confidence thresholding cannot separate CCO from correct directional calls are practical observations. The paper also notes the result does not replicate on VitaminC-Mixed for the voting amplification.

The soft spot is the load-bearing assumption that CONFLICTING is the authorized non-directional option under the actual prompts. The abstract gives no evidence the task contract language appears in the model instructions; if the prompt only lists the three options without usage rules, directional outputs are not contract violations. That weakens the safety-gap interpretation. Soundness is also limited by the abstract-only view here—no methods, splits, or code are visible, so the 84% figure and cross-dataset claims cannot be checked directly.

This is for teams running LLM judges in fact-checking or verification pipelines who need to handle conflicting sources. Readers focused on prompt engineering or multi-channel control might find the diagnostics useful. It deserves a serious referee because the empirical pattern on AVeriTeC is concrete enough to warrant checking, even if the contract framing needs tightening in revision.

Referee Report

2 major / 1 minor

Summary. The paper defines Cherry-pick Override (CCO) as the unauthorized return of directional verdicts (SUPPORTS/REFUTES) by three-option LLM judges on mixed-evidence claims when CONFLICTING is the authorized non-directional option under an explicit task contract. It reports >84% directional commitment on AVeriTeC's Conflicting subset (N_C=150), shows that majority voting amplifies this on AVeriTeC but not VitaminC-Mixed, evaluates an intervention ladder (typed vocabulary, panel aggregation, confidence thresholding, validator filtering), and proposes a two-channel reference probe plus external commitment-control layer using structural evidence and NO-COMMIT states.

Significance. If the task-contract assumption holds, the work identifies a systematic safety issue in LLM judges for evidence-based tasks, with the bootstrap resampling, random-veto null model, cross-dataset comparison, and intervention-ladder analysis providing concrete, falsifiable measurements. The selectivity result under the null model (p<1/2001 on AVeriTeC) is a notable strength.

major comments (2)

[Task Contract Definition and Prompt Construction] The central claim that directional outputs constitute CCO (unauthorized commitment) is load-bearing on the premise that the three-option prompts explicitly instruct selection of CONFLICTING precisely when evidence is mixed. The manuscript must include the exact prompt templates (likely Section 3 or Appendix A) showing the task-contract language; absent this, the 84% rate on the Conflicting subset measures prompt underspecification rather than contract violation.
[Empirical Results on Panel Aggregation] Results paragraph on majority voting: the reported amplification (0.887 vs. 0.840, 95% CI [+0.013, +0.080]) on AVeriTeC but non-replication on VitaminC-Mixed requires explicit comparison of how the Conflicting subsets are constructed and labeled in each dataset to establish that the difference is not an artifact of sampling or annotation protocol.

minor comments (1)

[Abstract] The abstract introduces terms such as 'typed schema' and 'validator-only filtering' without a one-sentence gloss; a brief parenthetical definition would improve accessibility before the intervention ladder is detailed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying two areas where additional detail will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Task Contract Definition and Prompt Construction] The central claim that directional outputs constitute CCO (unauthorized commitment) is load-bearing on the premise that the three-option prompts explicitly instruct selection of CONFLICTING precisely when evidence is mixed. The manuscript must include the exact prompt templates (likely Section 3 or Appendix A) showing the task-contract language; absent this, the 84% rate on the Conflicting subset measures prompt underspecification rather than contract violation.

Authors: We agree that the exact prompt templates are required to substantiate the task-contract premise. The current manuscript defines CCO under an explicit contract but does not reproduce the full three-option prompts. In the revised manuscript we will add the complete prompt templates to Appendix A, with the CONFLICTING instruction language highlighted. This addition will demonstrate that the reported directional commitment rate measures violation of the stated contract rather than underspecification. revision: yes
Referee: [Empirical Results on Panel Aggregation] Results paragraph on majority voting: the reported amplification (0.887 vs. 0.840, 95% CI [+0.013, +0.080]) on AVeriTeC but non-replication on VitaminC-Mixed requires explicit comparison of how the Conflicting subsets are constructed and labeled in each dataset to establish that the difference is not an artifact of sampling or annotation protocol.

Authors: We accept that the cross-dataset comparison would benefit from an explicit side-by-side account of subset construction. Although the manuscript already reports the differing majority-voting outcomes and notes the datasets' distinct origins, it does not provide a dedicated comparison of labeling protocols. In the revision we will add a short subsection (or table) detailing the Conflicting-subset identification and annotation procedures for both AVeriTeC and VitaminC-Mixed, including sampling frames and any protocol differences. This will allow readers to evaluate whether the observed amplification difference is attributable to dataset construction. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurement against independent null

full rationale

The paper defines CCO via an explicit task contract and reports rates on public datasets using bootstrap resampling plus an independently constructed random-veto null model. No equations, fitted parameters, or self-citations reduce the reported percentages or intervention results to the inputs by construction; the protocol remains externally falsifiable and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that CONFLICTING is the authorized non-directional option and on empirical counts from two specific datasets; no free parameters are reported as fitted to produce the headline percentages.

axioms (1)

domain assumption The task schema exposes CONFLICTING as the authorized non-directional verdict for mixed evidence.
Invoked to define directional returns as unauthorized commitments.

invented entities (1)

Cherry-pick Override (CCO) no independent evidence
purpose: To label the specific failure of directional commitment on mixed-evidence claims.
Conceptual naming of observed behavior; no independent falsifiable prediction supplied beyond the reported measurements.

pith-pipeline@v0.9.1-grok · 5890 in / 1353 out tokens · 27170 ms · 2026-06-27T20:59:59.872214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 linked inside Pith

[1]

Bartlett, P

A gentle intro- duction to conformal prediction and distribution-free uncer- tainty quantification.arXiv preprint arXiv:2107.07511. Bartlett, P. L.; and Wegkamp, M. H

Pith/arXiv arXiv
[2]

El-Yaniv, R.; and Wiener, Y

Improving Factuality and Reasoning in Lan- guage Models through Multiagent Debate.arXiv preprint arXiv:2305.14325. El-Yaniv, R.; and Wiener, Y

Pith/arXiv arXiv
[3]

Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y

Not Wrong, But Untrue: LLM Overconfidence in Document- Based Queries.arXiv preprint arXiv:2509.25498. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y . J.; Madotto, A.; and Fung, P

arXiv
[4]

Jung, J.; Brahman, F.; and Choi, Y

Upholding Epistemic Agency: A Brouw- erian Assertibility Constraint for Responsible AI.arXiv preprint arXiv:2603.03971. Jung, J.; Brahman, F.; and Choi, Y

Pith/arXiv arXiv
[5]

InProceedings of the International Conference on Learning Representations (ICLR)

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agree- ment. InProceedings of the International Conference on Learning Representations (ICLR). ArXiv:2407.18370. Lifshitz, S.; McIlraith, S. A.; and Du, Y

arXiv
[6]

Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Ver- ifiers.arXiv preprint arXiv:2502.20379. Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C

arXiv
[7]

InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP). ArXiv:2303.16634. Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; tau Yih, W.; Koh, P. W.; Iyyer, M.; Zettlemoyer, L.; and Hajishirzi, H

Pith/arXiv arXiv 2023
[8]

InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)

FactScore: Fine-grained Atomic Evaluation of Factual Pre- cision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP). Schlichtkrull, M.; Guo, Z.; and Vlachos, A

2023
[9]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A VeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). ArXiv:2305.13117. Schuster, T.; Fisch, A.; and Barzilay, R

arXiv 2023
[10]

InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics (NAACL)

Get Your Vita- min C! Robust Fact Verification with Contrastive Evidence. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics (NAACL). ArXiv:2103.08541. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A

arXiv 2021
[11]

InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL)

FEVER: A Large-scale Dataset for Fact Extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL). Verga, P.; Hofst ¨atter, S.; Althammer, S.; Su, Y .; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; and Lewis, P

2018
[12]

V ovk, V .; Gammerman, A.; and Shafer, G

Replacing Judges with Juries: Evaluating LLM Gen- erations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796. V ovk, V .; Gammerman, A.; and Shafer, G. 2005.Algorith- mic Learning in a Random World. Springer. Wang, M. F.; Xie, H.; Wang, G.; Gao, A.; Yang, G.; Li, Z.; Qiu, Q. W.; Han, F.; Qiu, H.; Huang, Y .; Zhu, B.; and Woo, J. O

Pith/arXiv arXiv 2005
[13]

Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E

From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation.arXiv preprint arXiv:2604.07667. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D

Pith/arXiv arXiv
[14]

Waving the British flag will result in arrest for breach of the peace

Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track. ArXiv:2306.05685. A Qualitative Case Studies We provide six representative cases to illustrate the failure modes the controller catches and the failure modes it does not. The cases are illustrative, not exha...

Pith/arXiv arXiv

[1] [1]

Bartlett, P

A gentle intro- duction to conformal prediction and distribution-free uncer- tainty quantification.arXiv preprint arXiv:2107.07511. Bartlett, P. L.; and Wegkamp, M. H

Pith/arXiv arXiv

[2] [2]

El-Yaniv, R.; and Wiener, Y

Improving Factuality and Reasoning in Lan- guage Models through Multiagent Debate.arXiv preprint arXiv:2305.14325. El-Yaniv, R.; and Wiener, Y

Pith/arXiv arXiv

[3] [3]

Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y

Not Wrong, But Untrue: LLM Overconfidence in Document- Based Queries.arXiv preprint arXiv:2509.25498. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y . J.; Madotto, A.; and Fung, P

arXiv

[4] [4]

Jung, J.; Brahman, F.; and Choi, Y

Upholding Epistemic Agency: A Brouw- erian Assertibility Constraint for Responsible AI.arXiv preprint arXiv:2603.03971. Jung, J.; Brahman, F.; and Choi, Y

Pith/arXiv arXiv

[5] [5]

InProceedings of the International Conference on Learning Representations (ICLR)

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agree- ment. InProceedings of the International Conference on Learning Representations (ICLR). ArXiv:2407.18370. Lifshitz, S.; McIlraith, S. A.; and Du, Y

arXiv

[6] [6]

Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Ver- ifiers.arXiv preprint arXiv:2502.20379. Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C

arXiv

[7] [7]

InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP). ArXiv:2303.16634. Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; tau Yih, W.; Koh, P. W.; Iyyer, M.; Zettlemoyer, L.; and Hajishirzi, H

Pith/arXiv arXiv 2023

[8] [8]

InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)

FactScore: Fine-grained Atomic Evaluation of Factual Pre- cision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP). Schlichtkrull, M.; Guo, Z.; and Vlachos, A

2023

[9] [9]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A VeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). ArXiv:2305.13117. Schuster, T.; Fisch, A.; and Barzilay, R

arXiv 2023

[10] [10]

InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics (NAACL)

Get Your Vita- min C! Robust Fact Verification with Contrastive Evidence. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics (NAACL). ArXiv:2103.08541. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A

arXiv 2021

[11] [11]

InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL)

FEVER: A Large-scale Dataset for Fact Extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL). Verga, P.; Hofst ¨atter, S.; Althammer, S.; Su, Y .; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; and Lewis, P

2018

[12] [12]

V ovk, V .; Gammerman, A.; and Shafer, G

Replacing Judges with Juries: Evaluating LLM Gen- erations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796. V ovk, V .; Gammerman, A.; and Shafer, G. 2005.Algorith- mic Learning in a Random World. Springer. Wang, M. F.; Xie, H.; Wang, G.; Gao, A.; Yang, G.; Li, Z.; Qiu, Q. W.; Han, F.; Qiu, H.; Huang, Y .; Zhu, B.; and Woo, J. O

Pith/arXiv arXiv 2005

[13] [13]

Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E

From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation.arXiv preprint arXiv:2604.07667. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D

Pith/arXiv arXiv

[14] [14]

Waving the British flag will result in arrest for breach of the peace

Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track. ArXiv:2306.05685. A Qualitative Case Studies We provide six representative cases to illustrate the failure modes the controller catches and the failure modes it does not. The cases are illustrative, not exha...

Pith/arXiv arXiv