Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence
Pith reviewed 2026-06-27 20:59 UTC · model grok-4.3
The pith
LLM judges return directional verdicts on more than 84% of mixed-evidence claims even when CONFLICTING is the authorized option.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under an explicit task contract that exposes CONFLICTING as the authorized non-directional verdict, three-option LLM judges return a directional verdict on more than 84% of mixed-evidence claims in AVeriTeC's Conflicting subset. Three-judge majority voting raises the rate to 0.887 on AVeriTeC while leaving it unchanged on VitaminC-Mixed. Panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases, the panel's calibration on pure cases prevents confidence from isolating CCO, and validator filtering nearly halves pure-evidence accuracy. A minimal two-channel reference probe achieves targeted promotion to CONFLICTING under the random-veto null on AVeriTeC with empirical p
What carries the argument
Cherry-pick Override (CCO): the return of SUPPORTS or REFUTES on mixed evidence when CONFLICTING is the schema-authorized non-directional response.
If this is right
- Three-judge majority voting increases direction-on-conflict from 0.840 to 0.887 on AVeriTeC but does not replicate on VitaminC-Mixed.
- Panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases.
- The panel's ECE of 0.07 on pure-S/R cases means confidence thresholding cannot separate CCO from correct directional commits.
- Validator-as-classifier nearly halves accuracy on pure-evidence cases.
- A two-channel reference probe promotes to CONFLICTING in a structurally targeted manner under the random-veto null on AVeriTeC.
Where Pith is reading between the lines
- Deployed fact-verification pipelines may need an explicit external authorization step before accepting any directional verdict.
- The replication gap between AVeriTeC and VitaminC-Mixed suggests CCO severity depends on how evidence conflicts are distributed in a dataset.
- The same override pattern could appear in LLM decision systems asked to take sides on contested policy or legal questions.
- Training data that penalizes directional answers on mixed-evidence examples could be tested as a direct mitigation.
Load-bearing premise
The evaluation schema explicitly designates CONFLICTING as the only authorized non-directional verdict for mixed evidence, so any SUPPORTS or REFUTES counts as an unauthorized directional commitment.
What would settle it
Re-running the three-option judges on the identical AVeriTeC Conflicting subset after adding an explicit prompt instruction that directional answers are forbidden on mixed evidence and measuring whether the directional rate falls below 50%.
Figures
read the original abstract
LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines Cherry-pick Override (CCO) as the unauthorized return of directional verdicts (SUPPORTS/REFUTES) by three-option LLM judges on mixed-evidence claims when CONFLICTING is the authorized non-directional option under an explicit task contract. It reports >84% directional commitment on AVeriTeC's Conflicting subset (N_C=150), shows that majority voting amplifies this on AVeriTeC but not VitaminC-Mixed, evaluates an intervention ladder (typed vocabulary, panel aggregation, confidence thresholding, validator filtering), and proposes a two-channel reference probe plus external commitment-control layer using structural evidence and NO-COMMIT states.
Significance. If the task-contract assumption holds, the work identifies a systematic safety issue in LLM judges for evidence-based tasks, with the bootstrap resampling, random-veto null model, cross-dataset comparison, and intervention-ladder analysis providing concrete, falsifiable measurements. The selectivity result under the null model (p<1/2001 on AVeriTeC) is a notable strength.
major comments (2)
- [Task Contract Definition and Prompt Construction] The central claim that directional outputs constitute CCO (unauthorized commitment) is load-bearing on the premise that the three-option prompts explicitly instruct selection of CONFLICTING precisely when evidence is mixed. The manuscript must include the exact prompt templates (likely Section 3 or Appendix A) showing the task-contract language; absent this, the 84% rate on the Conflicting subset measures prompt underspecification rather than contract violation.
- [Empirical Results on Panel Aggregation] Results paragraph on majority voting: the reported amplification (0.887 vs. 0.840, 95% CI [+0.013, +0.080]) on AVeriTeC but non-replication on VitaminC-Mixed requires explicit comparison of how the Conflicting subsets are constructed and labeled in each dataset to establish that the difference is not an artifact of sampling or annotation protocol.
minor comments (1)
- [Abstract] The abstract introduces terms such as 'typed schema' and 'validator-only filtering' without a one-sentence gloss; a brief parenthetical definition would improve accessibility before the intervention ladder is detailed.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for identifying two areas where additional detail will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Task Contract Definition and Prompt Construction] The central claim that directional outputs constitute CCO (unauthorized commitment) is load-bearing on the premise that the three-option prompts explicitly instruct selection of CONFLICTING precisely when evidence is mixed. The manuscript must include the exact prompt templates (likely Section 3 or Appendix A) showing the task-contract language; absent this, the 84% rate on the Conflicting subset measures prompt underspecification rather than contract violation.
Authors: We agree that the exact prompt templates are required to substantiate the task-contract premise. The current manuscript defines CCO under an explicit contract but does not reproduce the full three-option prompts. In the revised manuscript we will add the complete prompt templates to Appendix A, with the CONFLICTING instruction language highlighted. This addition will demonstrate that the reported directional commitment rate measures violation of the stated contract rather than underspecification. revision: yes
-
Referee: [Empirical Results on Panel Aggregation] Results paragraph on majority voting: the reported amplification (0.887 vs. 0.840, 95% CI [+0.013, +0.080]) on AVeriTeC but non-replication on VitaminC-Mixed requires explicit comparison of how the Conflicting subsets are constructed and labeled in each dataset to establish that the difference is not an artifact of sampling or annotation protocol.
Authors: We accept that the cross-dataset comparison would benefit from an explicit side-by-side account of subset construction. Although the manuscript already reports the differing majority-voting outcomes and notes the datasets' distinct origins, it does not provide a dedicated comparison of labeling protocols. In the revision we will add a short subsection (or table) detailing the Conflicting-subset identification and annotation procedures for both AVeriTeC and VitaminC-Mixed, including sampling frames and any protocol differences. This will allow readers to evaluate whether the observed amplification difference is attributable to dataset construction. revision: yes
Circularity Check
No circularity; empirical measurement against independent null
full rationale
The paper defines CCO via an explicit task contract and reports rates on public datasets using bootstrap resampling plus an independently constructed random-veto null model. No equations, fitted parameters, or self-citations reduce the reported percentages or intervention results to the inputs by construction; the protocol remains externally falsifiable and self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The task schema exposes CONFLICTING as the authorized non-directional verdict for mixed evidence.
invented entities (1)
-
Cherry-pick Override (CCO)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A gentle intro- duction to conformal prediction and distribution-free uncer- tainty quantification.arXiv preprint arXiv:2107.07511. Bartlett, P. L.; and Wegkamp, M. H
-
[2]
Improving Factuality and Reasoning in Lan- guage Models through Multiagent Debate.arXiv preprint arXiv:2305.14325. El-Yaniv, R.; and Wiener, Y
-
[3]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y
Not Wrong, But Untrue: LLM Overconfidence in Document- Based Queries.arXiv preprint arXiv:2509.25498. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y .; Ishii, E.; Bang, Y . J.; Madotto, A.; and Fung, P
-
[4]
Jung, J.; Brahman, F.; and Choi, Y
Upholding Epistemic Agency: A Brouw- erian Assertibility Constraint for Responsible AI.arXiv preprint arXiv:2603.03971. Jung, J.; Brahman, F.; and Choi, Y
-
[5]
InProceedings of the International Conference on Learning Representations (ICLR)
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agree- ment. InProceedings of the International Conference on Learning Representations (ICLR). ArXiv:2407.18370. Lifshitz, S.; McIlraith, S. A.; and Du, Y
-
[6]
Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Ver- ifiers.arXiv preprint arXiv:2502.20379. Liu, Y .; Iter, D.; Xu, Y .; Wang, S.; Xu, R.; and Zhu, C
-
[7]
InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing (EMNLP). ArXiv:2303.16634. Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; tau Yih, W.; Koh, P. W.; Iyyer, M.; Zettlemoyer, L.; and Hajishirzi, H
Pith/arXiv arXiv 2023
-
[8]
InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP)
FactScore: Fine-grained Atomic Evaluation of Factual Pre- cision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP). Schlichtkrull, M.; Guo, Z.; and Vlachos, A
2023
-
[9]
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
A VeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). ArXiv:2305.13117. Schuster, T.; Fisch, A.; and Barzilay, R
arXiv 2023
-
[10]
Get Your Vita- min C! Robust Fact Verification with Contrastive Evidence. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics (NAACL). ArXiv:2103.08541. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A
arXiv 2021
-
[11]
InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL)
FEVER: A Large-scale Dataset for Fact Extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics (NAACL). Verga, P.; Hofst ¨atter, S.; Althammer, S.; Su, Y .; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; and Lewis, P
2018
-
[12]
V ovk, V .; Gammerman, A.; and Shafer, G
Replacing Judges with Juries: Evaluating LLM Gen- erations with a Panel of Diverse Models.arXiv preprint arXiv:2404.18796. V ovk, V .; Gammerman, A.; and Shafer, G. 2005.Algorith- mic Learning in a Random World. Springer. Wang, M. F.; Xie, H.; Wang, G.; Gao, A.; Yang, G.; Li, Z.; Qiu, Q. W.; Han, F.; Qiu, H.; Huang, Y .; Zhu, B.; and Woo, J. O
Pith/arXiv arXiv 2005
-
[13]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E
From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation.arXiv preprint arXiv:2604.07667. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D
-
[14]
Waving the British flag will result in arrest for breach of the peace
Judging LLM-as-a- Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track. ArXiv:2306.05685. A Qualitative Case Studies We provide six representative cases to illustrate the failure modes the controller catches and the failure modes it does not. The cases are illustrative, not exha...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.