pith. sign in

arxiv: 2605.27914 · v2 · pith:HB5AJJODnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

Pith reviewed 2026-06-29 12:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationsubjective behaviorcapability transferself-evolving benchmarkadvice restraintscaling dissociationtrust-by-constructionanti-gaming fitness
0
0 comments X

The pith

Capability that scales on objective benchmarks does not transfer to subjective behaviors in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether scaling on verifiable tasks like math and code carries over to subjective, human-facing uses such as companionship and emotional support. It builds a self-evolving instrument that generates its own behavioral dimensions under an anti-gaming fitness function and stops when gains cease. The instrument operates under a trust-by-construction approach that establishes three certificates without any human gold standard. Applied across 49 models from 8 families over 24 months, it shows that subjective behaviors form a separate regime: objective scaling does not predict them. The clearest dissociation appears in advice-restraint, which ranks lowest at the frontier and regressed between GPT-4.1 and GPT-5 even as aggregate scores rose.

Core claim

Capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations.

What carries the argument

A self-evolving instrument that selects and authors its own behavioral dimensions under multiplicative anti-gaming fitness, paired with a trust-by-construction paradigm that earns validity through three certificates established without a human gold standard.

If this is right

  • Advice-restraint remains the lowest-scoring subjective dimension across the entire frontier.
  • Aggregate capability scores can conceal regressions in specific subjective behaviors that a single targeted instruction can reverse.
  • Warm restraint depends on the particular model generation rather than increases in scale, width, or inference budget.
  • Open-weight models reach the same subjective performance level as closed flagships at substantially lower per-call cost.
  • Multiple independent judge families reproduce the same rubric scores on conversations outside the instrument's training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need separate scaling laws or training objectives for subjective behaviors rather than relying on objective benchmark gains alone.
  • The observed dissociation raises the possibility that safety and alignment techniques affect subjective restraint more than raw capability measures.
  • The same instrument could be applied to other human-facing domains such as medical advice or educational tutoring to test whether dissociation appears there as well.
  • If the three certificates hold, future evaluations could shift from human correlation to certificate verification for subjective regimes.

Load-bearing premise

The self-evolving instrument under multiplicative anti-gaming fitness and the trust-by-construction paradigm can validly measure subjective behaviors without a human gold standard, despite human raters showing low agreement.

What would settle it

A new model series in which advice-restraint scores rise monotonically with the same scaling factors that improve objective benchmarks, or in which the instrument's output diverges from high-agreement human ratings on the same held-out conversations.

Figures

Figures reproduced from arXiv: 2605.27914 by Junchen Wan, Lei Wang, Pengjie Ding, Yao Liu, Yuming (Rapheal) Huang.

Figure 1
Figure 1. Figure 1: Autonomous evaluation pipeline. (1) Self-evolved rubric: the iterative discrimination￾maximization procedure stabilized to a 9-dimension set across rounds; the dimensions themselves were not pre-stipulated (pre-registration applies to the H1–H10 hypotheses and 11 forward predictions, not to the rubric dimensions); (2) multi-turn conversation collection across 30 scenarios per sub-domain × 7 sub￾domains × 3… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-family per-dimension scoreboard (Bloom-Benchmarks-style). Six dimensions (columns) × 34 model tiers (rows, grouped into 8 families by color). Each cell: light-gray bar to mean, family-colored dot at mean across N=30 scenarios, ±1 SD error tick, numerical mean to the right of the dot. Per-column header gives the dimension name and a one-sentence description of what it measures. Family colors are held … view at source ↗
Figure 3
Figure 3. Figure 3: OpenAI generation arc on emotional accompaniment. The gpt-4o sideways step ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-family emergence depth. Caveat: families have substantially different tier-ladder ranges (Qwen3.5: 6 tiers spanning 100× total params; Gemini-2.5 and Claude-4-5: 3 tiers each; GPT-5.4: 4 tiers). Differences in emergence-count are informative about where each family currently exposes capability differences via its public tier ladder, not about underlying family capability. Opus-4.7 advice restraint regr… view at source ↗
Figure 5
Figure 5. Figure 5: Per-family adjacent-tier Cliff’s δ at cognitive vs. affective group granularity (decomposition fol￾lows Badawi 2026 [8]). Saturated bar = cognitive group; paler bar = affective group; family colors match [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Open-frontier models numerically top closed frontier on all 9 dimensions; gaps within 0.2– 0.6 points. Frontier-tier mean score across the self-evolved 9-dim rubric (Slice 2, N=30 scenarios); red border marks the per-dim winner. GLM-5 holds 7 of 9 per-dim wins; Kimi-K2.5 and MiniMax-M2.5 take the other two. Closed-frontier (gpt-5.4-pro, claude-opus-4-5) finish fourth and fifth overall, within 0.4 of GLM￾5 … view at source ↗
Figure 7
Figure 7. Figure 7: Judge × judge Spearman ρ matrix on rubric-following across N=297 stratified conversations. Five judges: canonical claude-sonnet-4-6 (Anthropic) plus four cross-family judges. The Qwen3.5-397B– GLM-5 pair shows the highest non-canonical agreement (ρ=0.642); the gpt-5.4–claude-sonnet-4-6 pair the lowest (ρ=0.342). the measurement stack. 3.9 Pre-registered hypothesis verdicts (supports M3) Of 10 pre-registere… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-timeline OpenAI judges × canonical claude-sonnet-4-6 (September 2025). ρ tracks judge capability, not release date per se: gpt-4.1 (April 2025, 5mo back) at ρ=0.62 sits within the 2025–2026 contemporary cohort band (shaded). The gradient is smooth and monotonic — no discrete cohort jump. canonical Qwen-397B DeepSeek-V3.2 GLM-5 canonical-claude-sonnet-4-6 1.000 0.808 0.749 0.850 Qwen3.5-397B-A17B 0.80… view at source ↗
Figure 9
Figure 9. Figure 9: Judge-human Spearman agreement bucketed by judge-score quartile. Per-bucket N below each bar; error bars are 95% bootstrap CIs (2000 iters); horizontal dashed line marks aggregate ρ ≈ 0.40. Top-bucket within-bucket ρ ≈ 0.62 is approximately 2× the aggregate; bottom-bucket CI crosses zero. Source: ESConv N=64 paired (conversation, supporter-self-rating) cells. finement does not unsaturate them even at five … view at source ↗
Figure 10
Figure 10. Figure 10: Pre-registered hypothesis verdicts. 10 pre-registered hypotheses H1–H10 with verdicts coded by color: Falsified (coral, 6 of 10), Partial (amber, 1), Supported (teal, 2), Deferred (gray, 1). Pre￾registration converts mis-located predictions into evidence about where our prior model of emergence was systematically wrong. Per-H prose and numerical detail in Appendix E. between-judge, reduced only by ensembl… view at source ↗
Figure 11
Figure 11. Figure 11: Rubric scores per (model organism, dimension). Stars mark the theoretically-expected top organism per dimension. Match rate: 7/7 top-1 (gold) plus 13/16 discriminating non-gold predictions con￾firmed (81%). N=10 scenarios per organism, canonical claude-sonnet-4-6 judge. ensemble measures something real is to ask where variance concentrates. If judges disagree uniformly across the score range the ensemble … view at source ↗
Figure 12
Figure 12. Figure 12: Cross-judge vs. within-judge variance as a function of mean score. The downward-U in 5-judge cross-judge std (peak 4.23 at mean≈ 6.25; floors 1.11 and 0.68) tracks where genuine construct ambiguity exists; the flat within-judge K=2 noise floor (≈ 0.50) holds across the whole range. Together they show the multi-judge ensemble is measuring construct ambiguity, not shared judge bias. 3.14 Pre-registered forw… view at source ↗
Figure 13
Figure 13. Figure 13: Cost-quality Pareto across N=49 tested target models in 8 families; frontier spans DeepSeek, GLM, Kimi, and MiniMax. Each dot is one model; per-call generation cost (log scale, x-axis) from cost log.jsonl averaged over all logged calls; mean rubric score (y-axis) is the per-judgment mean across 9 rubric dimensions, aggregated over all slices in which the model appeared as a target. The bold red line is th… view at source ↗
Figure 14
Figure 14. Figure 14: DeepSeek generation arc: V3 → V3.2 → V4-Flash chat-mode improvement (7.64 → 8.45 → 8.65); R1 reasoning-fork tracks below the contemporary chat-mode peer. Per-dim breakdown of the Opus-4.7 aggregate regression. The Opus-4.7 marginal aggregate regression of −0.12 is concentrated rather than diffuse: advice restraint drops −0.629 and trait contradiction severity drops −0.486 from Opus-4.6, while 5 of 9 other… view at source ↗
Figure 15
Figure 15. Figure 15: GLM (Zhipu) generation arc: GLM-4-9B/32B (Apr’25, mean [PITH_FULL_IMAGE:figures/full_fig_p063_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Cross-family reasoning-track timeline. Reasoning models (red triangles) plotted against contem [PITH_FULL_IMAGE:figures/full_fig_p064_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qwen open-weight generation arc (Sep’24 → Apr’26): Qwen2.5-72B-Instruct (5.41) → Qwen3- 32B (6.70) → Qwen3.5-397B-A17B (8.40) → Qwen3.6-27B/35B-A3B (8.41, 8.09). Largest cumulative open-weight arc in our roster (+2.99 over 17 months); the Qwen3.5→Qwen3.6 step plateaus or mildly regresses. δ=−0.62, CI [−0.74, −0.49]; GPT-5.4 nano→pro δ=−0.56, CI [−0.69, −0.43]). Benjamini-Hochberg FDR at q=0.05 on cross-ge… view at source ↗
Figure 18
Figure 18. Figure 18: Anthropic Claude generation arc (May’25 → May’26): Sonnet-4 (8.20) → Sonnet-4.5 (8.68) → Sonnet-4.6 (9.33) monotone (+1.13); Opus-4 (8.41) → 4.1 (8.69) → 4.5 (9.10) → 4.6 (9.18) monotone (+0.77); Opus-4.7 (9.06) regresses marginally on aggregate. Claude-3.5 deprecated on Anthropic direct API (unavailable). Highest absolute floor of any family in our roster (≥ 8.20 throughout). restricted to the gpt-4.1→gp… view at source ↗
Figure 19
Figure 19. Figure 19: “Thinking Process:” leak in Qwen3.5 assistant turns across [PITH_FULL_IMAGE:figures/full_fig_p068_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Same OpenAI conversations scored by 4 rubric versions through the iterative evolution loop. [PITH_FULL_IMAGE:figures/full_fig_p069_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: reports per-dim Spearman ρ between each non-canonical judge and canonical claude-sonnet-4-6, sorted by the minimum ρ across judges. Reading: a dim with a high minimum is reliable across all 5 judges in the absolute-score sense; a dim with a low minimum is reliable only as a rank-ordering instrument. The polarity-broken trait contradiction severity (last row) fails on all judges and is the reason it is dro… view at source ↗
Figure 22
Figure 22. Figure 22: Within-judge K=2 reliability per dim. Left panel: mean run-to-run std (lower = quieter judge). Right panel: perfect-agreement rate (higher = more reproducible). Two dims (safety calibration, emotional calibration) achieve >90% perfect agreement, partly be￾cause they near-ceiling-saturate; three dims (advice restraint, memory recall appropriate, persona stability target) carry 3–5× the run-to-run noise and… view at source ↗
Figure 23
Figure 23. Figure 23: Per-dim reliability indices on the 5-judge ensemble. [PITH_FULL_IMAGE:figures/full_fig_p074_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Per-judge mean-score difference (canonical [PITH_FULL_IMAGE:figures/full_fig_p075_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Cross-timeline OpenAI judges × canonical claude-sonnet-4-6 (September 2025). ρ tracks judge capability, not release date per se: gpt-4.1 (April 2025, 5mo back) at ρ=0.62 sits within the 2025–2026 contemporary cohort band (shaded). The gradient is smooth and monotonic — no discrete cohort jump. (Same figure also appears as [PITH_FULL_IMAGE:figures/full_fig_p076_25.png] view at source ↗
read the original abstract

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a self-evolving instrument that authors its own behavioral dimensions under multiplicative anti-gaming fitness and self-halts when improvement stops; a trust-by-construction evaluation paradigm that earns validity through three certificates without a human gold standard (citing low inter-rater rho ~0.45); and reports that objective-benchmark scaling fails to transfer to subjective behaviors across 49 models, 8 families, and 24 months. The sharpest dissociation is advice-restraint, the frontier's universal-lowest dimension, which regressed from gpt-4.1 to gpt-5 while aggregate scores masked it; warm restraint is driven by generation rather than scale, MoE width, or inference budget; open-weight models match closed flagships at lower cost; and four judge families replicate the rubric on held-out ESConv data.

Significance. If the instrument and certificates are shown to be non-circular, the dissociation result would be significant for LLM evaluation in human-facing domains, demonstrating that objective scaling does not guarantee subjective behavior and highlighting a specific regression recoverable by one instruction. The release of data, code, locked rubric, and prompts would support reproducibility. The approach addresses a real gap where human agreement is low, but its validity hinges on external validation of the certificates.

major comments (2)
  1. [Abstract / trust-by-construction paradigm] Abstract and trust-by-construction section: the claim that the three certificates earn belief independently of a human gold standard is load-bearing for the dissociation result, yet the description indicates the certificates are established within the same evolutionary loop and multiplicative fitness; if any certificate is defined by internal outputs or the held-out ESConv replication uses the derived rubric rather than an independent behavioral proxy, the measurement of advice-restraint (and the gpt-4.1→gpt-5 regression) risks circularity.
  2. [Results / advice-restraint dimension] Results on advice-restraint regression: the reported reversal at gpt-4.1 to gpt-5 while aggregate score improves is a central empirical claim, but without stability checks under altered fitness functions, different random seeds, or an external behavioral proxy (e.g., real user interaction logs), it is unclear whether the dimension remains stable or is an artifact of the self-evolving selection process.
minor comments (2)
  1. [Abstract] The abstract states 'four judge families replicate the rubric on held-out human ESConv conversations' but does not specify the exact replication metric or whether the judges were blinded to model identity.
  2. [Method] Notation for the multiplicative anti-gaming fitness function is not expanded in the provided abstract; a brief equation or pseudocode would clarify how the product is computed across dimensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on potential circularity in the trust-by-construction certificates and the stability of the advice-restraint regression. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / trust-by-construction paradigm] Abstract and trust-by-construction section: the claim that the three certificates earn belief independently of a human gold standard is load-bearing for the dissociation result, yet the description indicates the certificates are established within the same evolutionary loop and multiplicative fitness; if any certificate is defined by internal outputs or the held-out ESConv replication uses the derived rubric rather than an independent behavioral proxy, the measurement of advice-restraint (and the gpt-4.1→gpt-5 regression) risks circularity.

    Authors: The certificates are defined to operate outside the evolutionary loop itself. Certificate 1 verifies the multiplicative anti-gaming property of the fitness function by direct inspection of its functional form. Certificate 2 verifies self-halting via the convergence criterion applied after evolution completes. Certificate 3 applies the locked rubric (frozen after evolution) to entirely held-out ESConv conversations using four independent judge families; the ESConv data were never seen during dimension authoring or fitness evaluation. We will revise the trust-by-construction section to include an explicit independence diagram and a table mapping each certificate to its separation from the loop. revision: yes

  2. Referee: [Results / advice-restraint dimension] Results on advice-restraint regression: the reported reversal at gpt-4.1 to gpt-5 while aggregate score improves is a central empirical claim, but without stability checks under altered fitness functions, different random seeds, or an external behavioral proxy (e.g., real user interaction logs), it is unclear whether the dimension remains stable or is an artifact of the self-evolving selection process.

    Authors: We agree that additional robustness checks are warranted. The revised manuscript will report (i) re-runs of the full evolutionary process under an additive fitness variant and (ii) three independent random seeds, confirming that the gpt-4.1 to gpt-5 advice-restraint reversal persists. The existing replication across four judge families on held-out ESConv already supplies an external behavioral proxy; real user interaction logs are not available to us and would require a separate data-collection effort outside the scope of this work. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper constructs a self-evolving instrument under an explicit multiplicative anti-gaming fitness and presents a trust-by-construction paradigm justified by three certificates whose definitions and stopping rule are stated as independent of human labels. The dissociation finding is reported as an empirical outcome across 49 models rather than a quantity derived by algebraic identity from the fitness function or certificates. No equation or step reduces a claimed prediction or validity certificate to a fitted input or self-citation by construction; the low inter-rater rho is used only to motivate skipping a gold standard, not to define the certificates themselves. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the self-evolving instrument and trust-by-construction certificates introduce new mechanisms whose internal parameters and validation steps are not detailed enough to enumerate free parameters or invented entities.

axioms (1)
  • domain assumption Inter-rater agreement for subjective judgments is low and structured by annotator identity (rho ~ 0.45)
    Invoked to justify abandoning human gold standards in favor of the new instrument.

pith-pipeline@v0.9.1-grok · 5890 in / 1553 out tokens · 45162 ms · 2026-06-29T12:51:50.538772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

140 extracted references · 51 canonical work pages · 26 internal anchors

  1. [1]

    Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007

  2. [2]

    MentalChat16K: A benchmark dataset for conversational mental health assistance

    Anonymous. MentalChat16K: A benchmark dataset for conversational mental health assistance. arXiv preprint arXiv:2503.13509, 2025

  3. [3]

    PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

    Anonymous. PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

  4. [4]

    Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

  5. [5]

    Taylor, Mark D´ıaz, Christopher M

    Lora Aroyo, Alex S. Taylor, Mark D´ıaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garc´ıa, Vinodkumar Prabhakaran, and Ding Wang. DICES dataset: Diversity in conversational AI evaluation for safety.arXiv preprint arXiv:2306.11247, 2023

  6. [6]

    Truth is a lie: Crowd truth and the seven myths of human annotation

    Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. InAI Magazine, volume 36, pages 15–24, 2015

  7. [7]

    CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

    CounselBench Authors. CounselBench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

  8. [8]

    Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

    Akram Badawi, Md Tahmid Rahman Laskar, Hossein Rahimi, et al. Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

  9. [9]

    MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InACL, 2024

  10. [10]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. Train- ing a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  11. [11]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

  12. [12]

    Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

    Yushi Bai et al. Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

  13. [13]

    We need to consider disagreement in evaluation

    Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, and Alexandra Uma. We need to consider disagreement in evaluation. InBPPF Workshop, ACL, 2021

  14. [14]

    LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern ´andez, Al- bert Gatt, et al. LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks. InACL, 2025

  15. [15]

    A systematic review of repro- ducibility research in natural language processing

    Anya Belz, Shubham Agarwal, Anastasia Shimorina, and Ehud Reiter. A systematic review of repro- ducibility research in natural language processing. InEACL, 2021. 22

  16. [16]

    Some latent trait models and their use in inferring an examinee’s ability

    Allan Birnbaum. Some latent trait models and their use in inferring an examinee’s ability. InStatis- tical Theories of Mental Test Scores, pages 397–479. Addison-Wesley, 1968

  17. [17]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  18. [18]

    Brennan.Generalizability Theory

    Robert L. Brennan.Generalizability Theory. Springer, New York, 2001

  19. [19]

    Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

    William Brown. Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

  20. [20]

    Burleson

    Brant R. Burleson. Emotional support skills. InHandbook of Communication and Social Interaction Skills. Lawrence Erlbaum Associates, 2003

  21. [22]

    Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

    Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

  22. [23]

    Toward a perspectivist turn in ground truthing for predictive computing

    Federico Cabitza, Andrea Campagner, and Valerio Basile. Toward a perspectivist turn in ground truthing for predictive computing. InAAAI, 2023

  23. [24]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´er´emy Scheurer, Javier Rando, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

  24. [25]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. ChatEval: Towards better LLM-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201, 2023

  25. [26]

    Humans or LLMs as the judge? a study on judgement biases

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement biases. InEMNLP, 2024

  26. [27]

    Can large language models be an alternative to human evalua- tions? InACL, 2023

    Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalua- tions? InACL, 2023

  27. [28]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

  28. [29]

    Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

    Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

  29. [30]

    Lawrence Erlbaum Associates, 1996

    Norman Cliff.Ordinal Methods for Behavioral Data Analysis. Lawrence Erlbaum Associates, 1996

  30. [31]

    Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E

    Kevin Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany J. Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E. Hunter. Three dimensions of reproducibility in natural language processing. InLREC, 2018

  31. [32]

    TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024

    Jonathan Cook, Tim Rockt ¨aschel, Jakob Foerster, Dennis Aumiller, and Alex Wang. TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024. 23

  32. [33]

    Cronbach and Paul E

    Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4):281–302, 1955

  33. [34]

    Dealing with disagreements: Looking beyond the majority vote in subjective annotations

    Aida Mostafazadeh Davani, Mark D ´ıaz, and Vinodkumar Prabhakaran. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. InNAACL, 2022

  34. [35]

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

  35. [36]

    Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InEMNLP-IJCNLP, pages 2185–2194, 2019

  36. [37]

    Understanding emergent abilities of lan- guage models from the loss perspective

    Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of lan- guage models from the loss perspective. InNeurIPS, 2024

  37. [38]

    Hashimoto

    Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled Al- pacaEval: A simple way to debias automatic evaluators. InCOLM, 2024

  38. [39]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InNeurIPS, 2023

  39. [40]

    Ebel and David A

    Robert L. Ebel and David A. Frisbie.Essentials of Educational Measurement. Prentice-Hall, 5th edition, 1991

  40. [41]

    Tibshirani.An Introduction to the Bootstrap

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

  41. [42]

    When the majority is wrong: Modeling annotator dis- agreement for subjective tasks

    Eve Fleisig, Rediet Abebe, and Dan Klein. When the majority is wrong: Modeling annotator dis- agreement for subjective tasks. InEMNLP, 2023

  42. [43]

    Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

    Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Frieder, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

  43. [44]

    Language models scale reliably with over-training and on downstream tasks

    Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, et al. Language models scale reliably with over-training and on downstream tasks. InNeurIPS, 2024

  44. [45]

    Predictability and surprise in large generative models

    Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, et al. Predictability and surprise in large generative models. InFAccT, 2022

  45. [47]

    Scaling Laws for Reward Model Overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization.arXiv preprint arXiv:2210.10760, 2022

  46. [48]

    Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´e III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

  47. [49]

    gar- den of forking paths

    Andrew Gelman and Eric Loken. The statistical crisis in science: Data-dependent analysis—a “gar- den of forking paths”.American Scientist, 102(6):460–465, 2014. 24

  48. [50]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  49. [51]

    LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

  50. [52]

    Clara E. Hill. Manual for the hill counselor verbal response category system (revised).Unpublished manuscript, University of Maryland, 1985

  51. [53]

    Train- ing compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, et al. Train- ing compute-optimal large language models. InNeurIPS, 2022

  52. [54]

    Holland and Dorothy T

    Paul W. Holland and Dorothy T. Thayer. Differential item performance and the Mantel-Haenszel procedure. In Howard Wainer and Henry I. Braun, editors,Test Validity, pages 129–145. Lawrence Erlbaum, 1988

  53. [55]

    Predicting emergent abilities with infinite resolution evaluation

    Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, et al. Predicting emergent abilities with infinite resolution evaluation. InICLR, 2024

  54. [56]

    Heart: A unified benchmark for humans and llms in emotional support dialogue

    Mrinank Iyer, Karan Aggarwal, Sanmi Koyejo, et al. Heart: A unified benchmark for humans and llms in emotional support dialogue. InarXiv preprint arXiv:2601.19922, 2026

  55. [57]

    LiveCodeBench: Holistic and contamination free eval- uation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free eval- uation of large language models for code. InICLR, 2025

  56. [58]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  57. [59]

    Rebecca Killick, Paul Fearnhead, and Idris A. Eckley. Optimal detection of changepoints with a linear computational cost.Journal of the American Statistical Association, 107(500):1590–1598, 2012

  58. [60]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InICLR, 2024

  59. [61]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InEMNLP, 2024

  60. [62]

    Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

    Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

  61. [63]

    Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

    Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

  62. [64]

    Specification gaming: The flip side of AI ingenuity

    Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020

  63. [65]

    Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011

    Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011. 25

  64. [66]

    MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

    Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

  65. [67]

    Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo

    John P. Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo. Item response theory for natural language processing. InEACL Tutorial Abstracts, 2024

  66. [68]

    Lalor, Hao Wu, and Hong Yu

    John P. Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In EMNLP, pages 648–657, 2016

  67. [69]

    Lalor, Hao Wu, and Hong Yu

    John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. InEMNLP-IJCNLP, 2019

  68. [70]

    Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

    Xinyu Li et al. Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

  69. [71]

    LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

    Yen-Ting Lin and Yun-Nung Chen. LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. InNLP4ConvAI Workshop, 2023

  70. [72]

    Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

    Hao Liu et al. Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

  71. [73]

    Towards emotional support dialog systems

    Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. InACL, 2021

  72. [74]

    G-Eval: NLG evaluation using GPT-4 with better human alignment

    Yang Liu et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP,

  73. [75]

    Calibrat- ing LLM-based evaluator

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, et al. Calibrat- ing LLM-based evaluator. InLREC-COLING, 2024

  74. [76]

    Lord and Melvin R

    Frederic M. Lord and Melvin R. Novick.Statistical Theories of Mental Test Scores. Addison-Wesley, 1968

  75. [77]

    Data contamination: From memorization to exploitation

    Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation. InACL, 2022

  76. [78]

    Categorizing Variants of Goodhart's Law

    David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2018

  77. [79]

    Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

    Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

  78. [80]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  79. [81]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InFAT*, pages 220–229, 2019

  80. [82]

    Munaf `o, Brian A

    Marcus R. Munaf `o, Brian A. Nosek, Dorothy V . M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, et al. A manifesto for reproducible science.Nature Human Behaviour, 1(1):0021, 2017. 26

Showing first 80 references.