pith. machine review for the scientific record. sign in

arxiv: 2605.04083 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords semantic evalsevaluation contractsmodel juriesexpert preferencespost-trainingRL evaluationcompact juriesfrontier juries
0
0 comments X

The pith

Human expert preferences can be operationalized as explicit semantic evaluation contracts where compact model juries achieve high agreement with frontier juries at much lower cost and latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for representing tasks as stable evaluation contracts that specify what gets graded, how each criterion is judged, and how decisions combine into an overall outcome. These contracts enable consistent scoring across different model juries for use in post-training evaluations. The authors compare a frontier jury of five advanced models against a compact jury of five smaller models on four solvers while keeping contracts fixed. They observe criterion agreement between 75.9 and 89.6 percent, higher dissent in the compact jury, but dramatically lower costs and latency with stable task outcomes. This matters for building evaluations that capture subjective expert requirements in reinforcement learning without relying on vague preference signals.

Core claim

By fixing the evaluation contracts and testing frontier versus compact juries, the work shows that criterion-level agreement reaches 75.9 to 89.6 percent and strict agreement 77.8 to 92.1 percent, with compact juries using 4.2 to 5.6 percent of the cost and 21.7 to 27.1 percent of the latency while keeping aggregated outcomes stable. This supports the idea that expert preferences can be faithfully encoded into semantic evals via explicit contracts.

What carries the argument

The evaluation contract, which explicitly defines the grading criteria, judgment process for each, and aggregation rules into a task outcome, serving as the fixed structure executed by model juries.

If this is right

  • Frontier and compact juries agree on 75.9 to 89.6 percent of criterion judgments.
  • Compact juries display higher internal dissent with 28.7 to 32.4 percent three-to-two splits versus 6.1 to 11.5 percent for frontier juries.
  • Per-criterion judging costs for compact juries fall to 4.2 to 5.6 percent of frontier jury costs.
  • Latency for compact juries is reduced to 21.7 to 27.1 percent of frontier levels.
  • Task-level outcomes remain comparatively stable even with jury differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designing stable contracts upfront could decouple evaluation quality from the scale of the judging models.
  • Disagreements within compact juries might serve as signals for refining contract criteria in future iterations.
  • Similar contract-based approaches could apply to other fields requiring consistent application of expert judgment, such as legal review or quality assurance.

Load-bearing premise

Human expert preferences can be faithfully captured in stable, explicit evaluation contracts without significant information loss or introduction of new biases.

What would settle it

Finding that human experts, when applying the contracts themselves to sample tasks, produce judgments that systematically differ from those of either jury on a large fraction of criteria would falsify the claim of faithful capture.

Figures

Figures reproduced from arXiv: 2605.04083 by Kyle Waters, Lucas Nuzzi, Steven Dillmann, Tadhg Looram.

Figure 1
Figure 1. Figure 1: A shallow ontology of LLM-as-a-Judge: judgment interface (rating vs. ranking), evalua￾tion contract (free-form vs. rubric-based), judge architecture (single vs. ensemble), and robustness strategies (bias auditing, protocol design). with minimal task-specific scaffolding. This ap￾proach was operationalized at scale in MT-Bench and Chatbot Arena, where strong models pro￾duced preference rankings for open-end… view at source ↗
Figure 2
Figure 2. Figure 2: Measuring failure modes: hybrid metrics and systematic biases. In response, several evaluation design strategies have emerged to counter such failures. One is to move from judges to juries, aggregating deci￾sions across diverse model families so that id￾iosyncratic failures are less likely to dominate the final score (Verga et al., 2024). Another is adaptivity via instance-conditioned judge se￾lection or w… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture flow for AsymmetryZero. The same evaluation contract is applied to both model-only (Inspect) and agentic (Harbor) ex￾ecutions, yielding comparable scores plus au￾ditable grading artifacts. 3.4 Outputs, auditing, and operator workflow Each evaluation run produces both a task-level verdict and a criterion-level trace. Using In￾6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9\%$ to $89.6\%$ (strict common-subset agreement: $77.8\%$ to $92.1\%$), while compact juries exhibit substantially higher internal dissent (3--2 split rate $28.7\%$--$32.4\%$) than frontier juries ($6.1\%$--$11.5\%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2\%$--$5.6\%$ of frontier and latency to roughly $21.7\%$--$27.1\%$, even as aggregated task-level outcomes often remain comparatively stable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AsymmetryZero, a framework that encodes tasks as explicit evaluation contracts specifying grading criteria, judgment procedures, and aggregation rules to operationalize human expert preferences as semantic evaluations usable with both model-only (Inspect) and agentic (Harbor) setups. It reports an empirical study holding contracts fixed while comparing a five-model frontier jury to a five-model compact jury across four frontier solvers, finding criterion-level agreement of 75.9–89.6% (strict common-subset 77.8–92.1%), higher internal dissent in compact juries (28.7–32.4% 3–2 splits vs. 6.1–11.5%), and compact-jury cost/latency reductions to roughly 4.2–5.6% and 21.7–27.1% of frontier values, with task-level outcomes remaining comparatively stable.

Significance. If the contracts prove to be faithful encodings of expert preferences, the framework offers a route to more auditable, reproducible, and cost-efficient semantic evaluations that could serve as both benchmarks and reward signals in post-training. The reported cost and latency savings, together with the fixed-contract design that enables direct jury comparisons, are concrete strengths that would be valuable if the fidelity claim is substantiated.

major comments (2)
  1. [Abstract and empirical study] Abstract and the empirical study: the central claim that evaluation contracts faithfully operationalize human expert preferences without significant information loss or new biases is not directly tested. The study measures only inter-model agreement and cost/latency deltas between frontier and compact juries on fixed contracts; it reports no human-expert ratings of the same contracts, no inter-rater reliability between humans and contract-derived judgments, and no ablation of contract wording against original expert intent. This leaves the proxy validity of model-jury agreement unestablished.
  2. [Abstract] Abstract: specific numerical claims (agreement ranges, 3–2 split rates, cost reductions to 4.2–5.6% and latency to 21.7–27.1%) are presented without accompanying details on contract construction process, data exclusion rules, statistical significance testing, or full methodology, which undercuts the ability to assess the robustness of the reported agreement and stability results.
minor comments (1)
  1. The paper would benefit from an explicit section or appendix detailing the contract template, example contracts, and the exact aggregation rule used to produce task-level outcomes from criterion decisions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below, clarifying the scope of the empirical study and committing to revisions where appropriate to improve clarity and precision.

read point-by-point responses
  1. Referee: [Abstract and empirical study] Abstract and the empirical study: the central claim that evaluation contracts faithfully operationalize human expert preferences without significant information loss or new biases is not directly tested. The study measures only inter-model agreement and cost/latency deltas between frontier and compact juries on fixed contracts; it reports no human-expert ratings of the same contracts, no inter-rater reliability between humans and contract-derived judgments, and no ablation of contract wording against original expert intent. This leaves the proxy validity of model-jury agreement unestablished.

    Authors: We agree with this assessment. The empirical study is specifically designed to hold evaluation contracts fixed and compare the behavior of frontier versus compact model juries in terms of agreement rates, internal consistency, and computational cost. It does not include direct human validation or ablation studies against original expert intent. The claim that the framework operationalizes human preferences is supported by the explicit contract design and its executability in both model-only and agentic settings, but the reported numbers reflect model-model agreement rather than human-model fidelity. We will revise the abstract to explicitly state that the study evaluates inter-jury agreement and efficiency under fixed contracts, without claiming direct empirical validation of preference encoding fidelity. This will better align the abstract with the study's actual contributions. revision: yes

  2. Referee: [Abstract] Abstract: specific numerical claims (agreement ranges, 3–2 split rates, cost reductions to 4.2–5.6% and latency to 21.7–27.1%) are presented without accompanying details on contract construction process, data exclusion rules, statistical significance testing, or full methodology, which undercuts the ability to assess the robustness of the reported agreement and stability results.

    Authors: The abstract is intended as a high-level summary, with full methodological details—including contract construction, model specifications, aggregation procedures, and any data handling—provided in the dedicated Experimental Setup and Results sections of the manuscript. We did not perform formal statistical significance testing on the agreement metrics, as the study reports descriptive comparisons across fixed contracts rather than inferential statistics. To improve accessibility, we will revise the abstract to briefly reference the methodology and direct readers to the full paper for details on contract construction and data processing. We can also note the descriptive nature of the results. revision: partial

Circularity Check

0 steps flagged

Framework definition and empirical jury comparisons are independent measurements with no reductions to inputs by construction

full rationale

The paper defines AsymmetryZero as an explicit evaluation contract framework and reports direct empirical results from a fixed-contract study: criterion-level agreement rates (75.9–89.6%), internal dissent splits, and cost/latency deltas between frontier and compact model juries. These quantities are measured outputs, not derived from or equivalent to the framework definition. No equations, fitted parameters presented as predictions, self-citations, or ansatzes appear in the load-bearing steps. The central claims rest on observable agreement and resource metrics rather than tautological re-labeling or self-referential justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that expert preferences are decomposable into stable explicit criteria; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Human expert preferences for complex tasks can be operationalized into stable, explicit criteria, judgment rules, and aggregation functions without substantial loss of fidelity.
    This is the foundational premise for the evaluation contract concept described in the abstract.
invented entities (1)
  • Evaluation contract no independent evidence
    purpose: To make grading criteria, judgment methods, and aggregation explicit for semantic evaluations
    Core new construct of the AsymmetryZero framework.

pith-pipeline@v0.9.0 · 5681 in / 1355 out tokens · 29888 ms · 2026-05-10T13:05:03.519974+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 8 canonical work pages

  1. [1]

    Humans or LLM s as the Judge? A Study on Judgement Bias

    Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou , booktitle =. Humans or. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.474 , url =

  2. [2]

    GPTS core: Evaluate as You Desire

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , booktitle =. 2024 , address =. doi:10.18653/v1/2024.naacl-long.365 , url =

  3. [3]

    A Survey on

    Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , howpublished =. A Survey on. 2024 , url =

  4. [4]

    LLM -rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts

    Hashemi, Helia and Eisner, Jason and Rosset, Corby and Van Durme, Benjamin and Kedzie, Chris , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.745 , url =

  5. [5]

    International Conference on Learning Representations , year =

    Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models , author =. International Conference on Learning Representations , year =

  6. [6]

    Proceedings of the 24th Annual Conference of the European Association for Machine Translation , year =

    Large Language Models Are State-of-the-Art Evaluators of Translation Quality , author =. Proceedings of the 24th Annual Conference of the European Association for Machine Translation , year =

  7. [7]

    Findings of the Association for Computational Linguistics: ACL 2024 , year =

    Benchmarking Cognitive Biases in Large Language Models as Evaluators , author =. Findings of the Association for Computational Linguistics: ACL 2024 , year =. doi:10.18653/v1/2024.findings-acl.29 , url =

  8. [8]

    International Conference on Learning Representations , year =

    Generative Judge for Evaluating Alignment , author =. International Conference on Learning Representations , year =

  9. [9]

    Who Judges the Judge?

    Li, Xiaochuan and Wang, Ke and Gouda, Girija Manash and Choudhary, Shubham and Wang, Yaqun and Hu, Linwei and Vaughan, Joel and Lecue, Freddy , howpublished =. Who Judges the Judge?. 2025 , url =

  10. [10]

    2004 , address =

    Lin, Chin-Yew , booktitle =. 2004 , address =

  11. [11]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , url =

  12. [12]

    and Feng, Shi , booktitle =

    Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle =. 2024 , url =

  13. [13]

    Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. 2002 , address =. doi:10.3115/1073083.1073135 , url =

  14. [14]

    Journal of Informatics and Web Engineering , year =

    HybridEval: An Improved Novel Hybrid Metric for Evaluation of Text Summarization , author =. Journal of Informatics and Web Engineering , year =. doi:10.33093/jiwe.2024.3.3.15 , url =

  15. [15]

    Judging the Judges: A Systematic Study of Position Bias in

    Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush , booktitle =. Judging the Judges: A Systematic Study of Position Bias in. 2025 , address =

  16. [16]

    Play Favorites: A Statistical Method to Measure Self-Bias in

    Spiliopoulou, Evangelia and Fogliato, Riccardo and Burnsky, Hanna and Soliman, Tamer and Ma, Jie and Horwood, Graham and Ballesteros, Miguel , howpublished =. Play Favorites: A Statistical Method to Measure Self-Bias in. 2025 , url =

  17. [17]

    2024 , url =

    Large Language Models are Inconsistent and Biased Evaluators , author =. 2024 , url =

  18. [18]

    Replacing Judges with Juries: Evaluating

    Verga, Pat and Hofstatter, Sebastian and Althammer, Sophia and Su, Yixuan and Piktus, Aleksandra and Arkhangorodsky, Arkady and Xu, Minjie and White, Naomi and Lewis, Patrick , howpublished =. Replacing Judges with Juries: Evaluating. 2024 , url =

  19. [19]

    2026 , url =

    Vidgen, Bertie and Mann, Austin and Fennelly, Abby and Stanly, John Wright and Rothman, Lucas and Burstein, Marco and Benchek, Julien and Ostrofsky, David and Ravichandran, Anirudh and Sur, Debnil and Venugopal, Neel and Hsia, Alannah and Robinson, Isaac and Huang, Calix and Varones, Olivia and Khan, Daniyal and Haines, Michael and Bridges, Austin and Boy...

  20. [20]

    2024 , url =

    Wang, Yidong and Yu, Zhuohao and Yao, Wenjin and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and Ye, Wei and Zhang, Shikun and Zhang, Yue , booktitle =. 2024 , url =

  21. [21]

    Justice or Prejudice? Quantifying Biases in

    Ye, Jiayi and Wang, Yanbo and Huang, Yue and Chen, Dongping and Zhang, Qihui and Moniz, Nuno and Gao, Tian and Geyer, Werner and Huang, Chao and Chen, Pin-Yu and Chawla, Nitesh V and Zhang, Xiangliang , booktitle =. Justice or Prejudice? Quantifying Biases in. 2025 , url =

  22. [22]

    2021 , url =

    Yuan, Weizhe and Neubig, Graham and Liu, Pengfei , booktitle =. 2021 , url =

  23. [23]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

  24. [24]

    Towards a Unified Multi-Dimensional Evaluator for Text Generation

    Towards a Unified Multi-Dimensional Evaluator for Text Generation , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2022.emnlp-main.131 , url =

  25. [25]

    2025 , url =

    Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong , booktitle =. 2025 , url =