Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

Junchen Wan; Lei Wang; Pengjie Ding; Yao Liu; Yuming (Rapheal) Huang

arxiv: 2605.27914 · v2 · pith:HB5AJJODnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

Yuming (Rapheal) Huang , Yao Liu , Pengjie Ding , Lei Wang , Junchen Wan This is my paper

Pith reviewed 2026-06-29 12:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM evaluationsubjective behaviorcapability transferself-evolving benchmarkadvice restraintscaling dissociationtrust-by-constructionanti-gaming fitness

0 comments

The pith

Capability that scales on objective benchmarks does not transfer to subjective behaviors in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether scaling on verifiable tasks like math and code carries over to subjective, human-facing uses such as companionship and emotional support. It builds a self-evolving instrument that generates its own behavioral dimensions under an anti-gaming fitness function and stops when gains cease. The instrument operates under a trust-by-construction approach that establishes three certificates without any human gold standard. Applied across 49 models from 8 families over 24 months, it shows that subjective behaviors form a separate regime: objective scaling does not predict them. The clearest dissociation appears in advice-restraint, which ranks lowest at the frontier and regressed between GPT-4.1 and GPT-5 even as aggregate scores rose.

Core claim

Capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations.

What carries the argument

A self-evolving instrument that selects and authors its own behavioral dimensions under multiplicative anti-gaming fitness, paired with a trust-by-construction paradigm that earns validity through three certificates established without a human gold standard.

If this is right

Advice-restraint remains the lowest-scoring subjective dimension across the entire frontier.
Aggregate capability scores can conceal regressions in specific subjective behaviors that a single targeted instruction can reverse.
Warm restraint depends on the particular model generation rather than increases in scale, width, or inference budget.
Open-weight models reach the same subjective performance level as closed flagships at substantially lower per-call cost.
Multiple independent judge families reproduce the same rubric scores on conversations outside the instrument's training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need separate scaling laws or training objectives for subjective behaviors rather than relying on objective benchmark gains alone.
The observed dissociation raises the possibility that safety and alignment techniques affect subjective restraint more than raw capability measures.
The same instrument could be applied to other human-facing domains such as medical advice or educational tutoring to test whether dissociation appears there as well.
If the three certificates hold, future evaluations could shift from human correlation to certificate verification for subjective regimes.

Load-bearing premise

The self-evolving instrument under multiplicative anti-gaming fitness and the trust-by-construction paradigm can validly measure subjective behaviors without a human gold standard, despite human raters showing low agreement.

What would settle it

A new model series in which advice-restraint scores rise monotonically with the same scaling factors that improve objective benchmarks, or in which the instrument's output diverges from high-agreement human ratings on the same held-out conversations.

Figures

Figures reproduced from arXiv: 2605.27914 by Junchen Wan, Lei Wang, Pengjie Ding, Yao Liu, Yuming (Rapheal) Huang.

**Figure 1.** Figure 1: Autonomous evaluation pipeline. (1) Self-evolved rubric: the iterative discriminationmaximization procedure stabilized to a 9-dimension set across rounds; the dimensions themselves were not pre-stipulated (pre-registration applies to the H1–H10 hypotheses and 11 forward predictions, not to the rubric dimensions); (2) multi-turn conversation collection across 30 scenarios per sub-domain × 7 subdomains × 3… view at source ↗

**Figure 2.** Figure 2: Cross-family per-dimension scoreboard (Bloom-Benchmarks-style). Six dimensions (columns) × 34 model tiers (rows, grouped into 8 families by color). Each cell: light-gray bar to mean, family-colored dot at mean across N=30 scenarios, ±1 SD error tick, numerical mean to the right of the dot. Per-column header gives the dimension name and a one-sentence description of what it measures. Family colors are held … view at source ↗

**Figure 3.** Figure 3: OpenAI generation arc on emotional accompaniment. The gpt-4o sideways step ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-family emergence depth. Caveat: families have substantially different tier-ladder ranges (Qwen3.5: 6 tiers spanning 100× total params; Gemini-2.5 and Claude-4-5: 3 tiers each; GPT-5.4: 4 tiers). Differences in emergence-count are informative about where each family currently exposes capability differences via its public tier ladder, not about underlying family capability. Opus-4.7 advice restraint regr… view at source ↗

**Figure 5.** Figure 5: Per-family adjacent-tier Cliff’s δ at cognitive vs. affective group granularity (decomposition follows Badawi 2026 [8]). Saturated bar = cognitive group; paler bar = affective group; family colors match [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Open-frontier models numerically top closed frontier on all 9 dimensions; gaps within 0.2– 0.6 points. Frontier-tier mean score across the self-evolved 9-dim rubric (Slice 2, N=30 scenarios); red border marks the per-dim winner. GLM-5 holds 7 of 9 per-dim wins; Kimi-K2.5 and MiniMax-M2.5 take the other two. Closed-frontier (gpt-5.4-pro, claude-opus-4-5) finish fourth and fifth overall, within 0.4 of GLM5 … view at source ↗

**Figure 7.** Figure 7: Judge × judge Spearman ρ matrix on rubric-following across N=297 stratified conversations. Five judges: canonical claude-sonnet-4-6 (Anthropic) plus four cross-family judges. The Qwen3.5-397B– GLM-5 pair shows the highest non-canonical agreement (ρ=0.642); the gpt-5.4–claude-sonnet-4-6 pair the lowest (ρ=0.342). the measurement stack. 3.9 Pre-registered hypothesis verdicts (supports M3) Of 10 pre-registere… view at source ↗

**Figure 8.** Figure 8: Cross-timeline OpenAI judges × canonical claude-sonnet-4-6 (September 2025). ρ tracks judge capability, not release date per se: gpt-4.1 (April 2025, 5mo back) at ρ=0.62 sits within the 2025–2026 contemporary cohort band (shaded). The gradient is smooth and monotonic — no discrete cohort jump. canonical Qwen-397B DeepSeek-V3.2 GLM-5 canonical-claude-sonnet-4-6 1.000 0.808 0.749 0.850 Qwen3.5-397B-A17B 0.80… view at source ↗

**Figure 9.** Figure 9: Judge-human Spearman agreement bucketed by judge-score quartile. Per-bucket N below each bar; error bars are 95% bootstrap CIs (2000 iters); horizontal dashed line marks aggregate ρ ≈ 0.40. Top-bucket within-bucket ρ ≈ 0.62 is approximately 2× the aggregate; bottom-bucket CI crosses zero. Source: ESConv N=64 paired (conversation, supporter-self-rating) cells. finement does not unsaturate them even at five … view at source ↗

**Figure 10.** Figure 10: Pre-registered hypothesis verdicts. 10 pre-registered hypotheses H1–H10 with verdicts coded by color: Falsified (coral, 6 of 10), Partial (amber, 1), Supported (teal, 2), Deferred (gray, 1). Preregistration converts mis-located predictions into evidence about where our prior model of emergence was systematically wrong. Per-H prose and numerical detail in Appendix E. between-judge, reduced only by ensembl… view at source ↗

**Figure 11.** Figure 11: Rubric scores per (model organism, dimension). Stars mark the theoretically-expected top organism per dimension. Match rate: 7/7 top-1 (gold) plus 13/16 discriminating non-gold predictions confirmed (81%). N=10 scenarios per organism, canonical claude-sonnet-4-6 judge. ensemble measures something real is to ask where variance concentrates. If judges disagree uniformly across the score range the ensemble … view at source ↗

**Figure 12.** Figure 12: Cross-judge vs. within-judge variance as a function of mean score. The downward-U in 5-judge cross-judge std (peak 4.23 at mean≈ 6.25; floors 1.11 and 0.68) tracks where genuine construct ambiguity exists; the flat within-judge K=2 noise floor (≈ 0.50) holds across the whole range. Together they show the multi-judge ensemble is measuring construct ambiguity, not shared judge bias. 3.14 Pre-registered forw… view at source ↗

**Figure 13.** Figure 13: Cost-quality Pareto across N=49 tested target models in 8 families; frontier spans DeepSeek, GLM, Kimi, and MiniMax. Each dot is one model; per-call generation cost (log scale, x-axis) from cost log.jsonl averaged over all logged calls; mean rubric score (y-axis) is the per-judgment mean across 9 rubric dimensions, aggregated over all slices in which the model appeared as a target. The bold red line is th… view at source ↗

**Figure 14.** Figure 14: DeepSeek generation arc: V3 → V3.2 → V4-Flash chat-mode improvement (7.64 → 8.45 → 8.65); R1 reasoning-fork tracks below the contemporary chat-mode peer. Per-dim breakdown of the Opus-4.7 aggregate regression. The Opus-4.7 marginal aggregate regression of −0.12 is concentrated rather than diffuse: advice restraint drops −0.629 and trait contradiction severity drops −0.486 from Opus-4.6, while 5 of 9 other… view at source ↗

**Figure 15.** Figure 15: GLM (Zhipu) generation arc: GLM-4-9B/32B (Apr’25, mean [PITH_FULL_IMAGE:figures/full_fig_p063_15.png] view at source ↗

**Figure 16.** Figure 16: Cross-family reasoning-track timeline. Reasoning models (red triangles) plotted against contem [PITH_FULL_IMAGE:figures/full_fig_p064_16.png] view at source ↗

**Figure 17.** Figure 17: Qwen open-weight generation arc (Sep’24 → Apr’26): Qwen2.5-72B-Instruct (5.41) → Qwen3- 32B (6.70) → Qwen3.5-397B-A17B (8.40) → Qwen3.6-27B/35B-A3B (8.41, 8.09). Largest cumulative open-weight arc in our roster (+2.99 over 17 months); the Qwen3.5→Qwen3.6 step plateaus or mildly regresses. δ=−0.62, CI [−0.74, −0.49]; GPT-5.4 nano→pro δ=−0.56, CI [−0.69, −0.43]). Benjamini-Hochberg FDR at q=0.05 on cross-ge… view at source ↗

**Figure 18.** Figure 18: Anthropic Claude generation arc (May’25 → May’26): Sonnet-4 (8.20) → Sonnet-4.5 (8.68) → Sonnet-4.6 (9.33) monotone (+1.13); Opus-4 (8.41) → 4.1 (8.69) → 4.5 (9.10) → 4.6 (9.18) monotone (+0.77); Opus-4.7 (9.06) regresses marginally on aggregate. Claude-3.5 deprecated on Anthropic direct API (unavailable). Highest absolute floor of any family in our roster (≥ 8.20 throughout). restricted to the gpt-4.1→gp… view at source ↗

**Figure 19.** Figure 19: “Thinking Process:” leak in Qwen3.5 assistant turns across [PITH_FULL_IMAGE:figures/full_fig_p068_19.png] view at source ↗

**Figure 20.** Figure 20: Same OpenAI conversations scored by 4 rubric versions through the iterative evolution loop. [PITH_FULL_IMAGE:figures/full_fig_p069_20.png] view at source ↗

**Figure 21.** Figure 21: reports per-dim Spearman ρ between each non-canonical judge and canonical claude-sonnet-4-6, sorted by the minimum ρ across judges. Reading: a dim with a high minimum is reliable across all 5 judges in the absolute-score sense; a dim with a low minimum is reliable only as a rank-ordering instrument. The polarity-broken trait contradiction severity (last row) fails on all judges and is the reason it is dro… view at source ↗

**Figure 22.** Figure 22: Within-judge K=2 reliability per dim. Left panel: mean run-to-run std (lower = quieter judge). Right panel: perfect-agreement rate (higher = more reproducible). Two dims (safety calibration, emotional calibration) achieve >90% perfect agreement, partly because they near-ceiling-saturate; three dims (advice restraint, memory recall appropriate, persona stability target) carry 3–5× the run-to-run noise and… view at source ↗

**Figure 23.** Figure 23: Per-dim reliability indices on the 5-judge ensemble. [PITH_FULL_IMAGE:figures/full_fig_p074_23.png] view at source ↗

**Figure 24.** Figure 24: Per-judge mean-score difference (canonical [PITH_FULL_IMAGE:figures/full_fig_p075_24.png] view at source ↗

**Figure 25.** Figure 25: Cross-timeline OpenAI judges × canonical claude-sonnet-4-6 (September 2025). ρ tracks judge capability, not release date per se: gpt-4.1 (April 2025, 5mo back) at ρ=0.62 sits within the 2025–2026 contemporary cohort band (shaded). The gradient is smooth and monotonic — no discrete cohort jump. (Same figure also appears as [PITH_FULL_IMAGE:figures/full_fig_p076_25.png] view at source ↗

read the original abstract

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a self-evolving rubric for subjective LLM behaviors that avoids human gold standards and reports scaling failures like advice-restraint regression, but the internal certificates leave the measurements unanchored.

read the letter

The main takeaway is that this work tries to measure things like restraint in advice-giving across models without relying on human raters, whose agreement is low. It generates its own dimensions through an evolutionary process with a multiplicative fitness function meant to block gaming, then claims three certificates establish trust on their own. Across 49 models it finds subjective traits do not follow objective scaling, with advice-restraint as the weakest area and a clear drop from GPT-4.1 to GPT-5 that a single instruction reverses.

What is new is the combination of self-evolving rubric generation, the anti-gaming fitness, and the explicit no-gold-standard setup. Releasing the locked rubric, code, and judge prompts is a concrete step that lets others inspect or rerun the process. The held-out ESConv replication across judge families adds some check on stability.

The soft spot is the lack of an external anchor. The certificates are generated inside the same loop that creates the dimensions, and the justification for skipping humans rests on their low agreement rather than a positive demonstration that the new method tracks actual behavior. Without tests under changed fitness functions, different random seeds, or an independent behavioral proxy, the reported regressions and Pareto results could be tied to the instrument's own construction. The abstract does not show those stability checks.

This is aimed at people building evaluations for counseling-style or emotional-support uses of LLMs. It deserves peer review because the target problem is real and the method differs from existing benchmarks, even though the central validity claim needs more external grounding to hold up.

Referee Report

2 major / 2 minor

Summary. The paper introduces a self-evolving instrument that authors its own behavioral dimensions under multiplicative anti-gaming fitness and self-halts when improvement stops; a trust-by-construction evaluation paradigm that earns validity through three certificates without a human gold standard (citing low inter-rater rho ~0.45); and reports that objective-benchmark scaling fails to transfer to subjective behaviors across 49 models, 8 families, and 24 months. The sharpest dissociation is advice-restraint, the frontier's universal-lowest dimension, which regressed from gpt-4.1 to gpt-5 while aggregate scores masked it; warm restraint is driven by generation rather than scale, MoE width, or inference budget; open-weight models match closed flagships at lower cost; and four judge families replicate the rubric on held-out ESConv data.

Significance. If the instrument and certificates are shown to be non-circular, the dissociation result would be significant for LLM evaluation in human-facing domains, demonstrating that objective scaling does not guarantee subjective behavior and highlighting a specific regression recoverable by one instruction. The release of data, code, locked rubric, and prompts would support reproducibility. The approach addresses a real gap where human agreement is low, but its validity hinges on external validation of the certificates.

major comments (2)

[Abstract / trust-by-construction paradigm] Abstract and trust-by-construction section: the claim that the three certificates earn belief independently of a human gold standard is load-bearing for the dissociation result, yet the description indicates the certificates are established within the same evolutionary loop and multiplicative fitness; if any certificate is defined by internal outputs or the held-out ESConv replication uses the derived rubric rather than an independent behavioral proxy, the measurement of advice-restraint (and the gpt-4.1→gpt-5 regression) risks circularity.
[Results / advice-restraint dimension] Results on advice-restraint regression: the reported reversal at gpt-4.1 to gpt-5 while aggregate score improves is a central empirical claim, but without stability checks under altered fitness functions, different random seeds, or an external behavioral proxy (e.g., real user interaction logs), it is unclear whether the dimension remains stable or is an artifact of the self-evolving selection process.

minor comments (2)

[Abstract] The abstract states 'four judge families replicate the rubric on held-out human ESConv conversations' but does not specify the exact replication metric or whether the judges were blinded to model identity.
[Method] Notation for the multiplicative anti-gaming fitness function is not expanded in the provided abstract; a brief equation or pseudocode would clarify how the product is computed across dimensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on potential circularity in the trust-by-construction certificates and the stability of the advice-restraint regression. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / trust-by-construction paradigm] Abstract and trust-by-construction section: the claim that the three certificates earn belief independently of a human gold standard is load-bearing for the dissociation result, yet the description indicates the certificates are established within the same evolutionary loop and multiplicative fitness; if any certificate is defined by internal outputs or the held-out ESConv replication uses the derived rubric rather than an independent behavioral proxy, the measurement of advice-restraint (and the gpt-4.1→gpt-5 regression) risks circularity.

Authors: The certificates are defined to operate outside the evolutionary loop itself. Certificate 1 verifies the multiplicative anti-gaming property of the fitness function by direct inspection of its functional form. Certificate 2 verifies self-halting via the convergence criterion applied after evolution completes. Certificate 3 applies the locked rubric (frozen after evolution) to entirely held-out ESConv conversations using four independent judge families; the ESConv data were never seen during dimension authoring or fitness evaluation. We will revise the trust-by-construction section to include an explicit independence diagram and a table mapping each certificate to its separation from the loop. revision: yes
Referee: [Results / advice-restraint dimension] Results on advice-restraint regression: the reported reversal at gpt-4.1 to gpt-5 while aggregate score improves is a central empirical claim, but without stability checks under altered fitness functions, different random seeds, or an external behavioral proxy (e.g., real user interaction logs), it is unclear whether the dimension remains stable or is an artifact of the self-evolving selection process.

Authors: We agree that additional robustness checks are warranted. The revised manuscript will report (i) re-runs of the full evolutionary process under an additive fitness variant and (ii) three independent random seeds, confirming that the gpt-4.1 to gpt-5 advice-restraint reversal persists. The existing replication across four judge families on held-out ESConv already supplies an external behavioral proxy; real user interaction logs are not available to us and would require a separate data-collection effort outside the scope of this work. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper constructs a self-evolving instrument under an explicit multiplicative anti-gaming fitness and presents a trust-by-construction paradigm justified by three certificates whose definitions and stopping rule are stated as independent of human labels. The dissociation finding is reported as an empirical outcome across 49 models rather than a quantity derived by algebraic identity from the fitness function or certificates. No equation or step reduces a claimed prediction or validity certificate to a fitted input or self-citation by construction; the low inter-rater rho is used only to motivate skipping a gold standard, not to define the certificates themselves. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the self-evolving instrument and trust-by-construction certificates introduce new mechanisms whose internal parameters and validation steps are not detailed enough to enumerate free parameters or invented entities.

axioms (1)

domain assumption Inter-rater agreement for subjective judgments is low and structured by annotator identity (rho ~ 0.45)
Invoked to justify abandoning human gold standards in favor of the new instrument.

pith-pipeline@v0.9.1-grok · 5890 in / 1553 out tokens · 45162 ms · 2026-06-29T12:51:50.538772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

140 extracted references · 51 canonical work pages · 26 internal anchors

[1]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[2]

MentalChat16K: A benchmark dataset for conversational mental health assistance

Anonymous. MentalChat16K: A benchmark dataset for conversational mental health assistance. arXiv preprint arXiv:2503.13509, 2025

work page arXiv 2025
[3]

PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

Anonymous. PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

work page arXiv 2025
[4]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Taylor, Mark D´ıaz, Christopher M

Lora Aroyo, Alex S. Taylor, Mark D´ıaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garc´ıa, Vinodkumar Prabhakaran, and Ding Wang. DICES dataset: Diversity in conversational AI evaluation for safety.arXiv preprint arXiv:2306.11247, 2023

work page arXiv 2023
[6]

Truth is a lie: Crowd truth and the seven myths of human annotation

Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. InAI Magazine, volume 36, pages 15–24, 2015

2015
[7]

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

CounselBench Authors. CounselBench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

Akram Badawi, Md Tahmid Rahman Laskar, Hossein Rahimi, et al. Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

work page arXiv 2026
[9]

MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InACL, 2024

2024
[10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. Train- ing a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

Yushi Bai et al. Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

work page arXiv 2024
[13]

We need to consider disagreement in evaluation

Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, and Alexandra Uma. We need to consider disagreement in evaluation. InBPPF Workshop, ACL, 2021

2021
[14]

LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern ´andez, Al- bert Gatt, et al. LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks. InACL, 2025

2025
[15]

A systematic review of repro- ducibility research in natural language processing

Anya Belz, Shubham Agarwal, Anastasia Shimorina, and Ehud Reiter. A systematic review of repro- ducibility research in natural language processing. InEACL, 2021. 22

2021
[16]

Some latent trait models and their use in inferring an examinee’s ability

Allan Birnbaum. Some latent trait models and their use in inferring an examinee’s ability. InStatis- tical Theories of Mental Test Scores, pages 397–479. Addison-Wesley, 1968

1968
[17]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[18]

Brennan.Generalizability Theory

Robert L. Brennan.Generalizability Theory. Springer, New York, 2001

2001
[19]

Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

William Brown. Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

1910
[20]

Burleson

Brant R. Burleson. Emotional support skills. InHandbook of Communication and Social Interaction Skills. Lawrence Erlbaum Associates, 2003

2003
[22]

Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

work page arXiv 2023
[23]

Toward a perspectivist turn in ground truthing for predictive computing

Federico Cabitza, Andrea Campagner, and Valerio Basile. Toward a perspectivist turn in ground truthing for predictive computing. InAAAI, 2023

2023
[24]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´er´emy Scheurer, Javier Rando, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. ChatEval: Towards better LLM-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Humans or LLMs as the judge? a study on judgement biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement biases. InEMNLP, 2024

2024
[27]

Can large language models be an alternative to human evalua- tions? InACL, 2023

Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalua- tions? InACL, 2023

2023
[28]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

1993
[30]

Lawrence Erlbaum Associates, 1996

Norman Cliff.Ordinal Methods for Behavioral Data Analysis. Lawrence Erlbaum Associates, 1996

1996
[31]

Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E

Kevin Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany J. Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E. Hunter. Three dimensions of reproducibility in natural language processing. InLREC, 2018

2018
[32]

TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024

Jonathan Cook, Tim Rockt ¨aschel, Jakob Foerster, Dennis Aumiller, and Alex Wang. TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024. 23

work page arXiv 2024
[33]

Cronbach and Paul E

Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4):281–302, 1955

1955
[34]

Dealing with disagreements: Looking beyond the majority vote in subjective annotations

Aida Mostafazadeh Davani, Mark D ´ıaz, and Vinodkumar Prabhakaran. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. InNAACL, 2022

2022
[35]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InEMNLP-IJCNLP, pages 2185–2194, 2019

2019
[37]

Understanding emergent abilities of lan- guage models from the loss perspective

Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of lan- guage models from the loss perspective. InNeurIPS, 2024

2024
[38]

Hashimoto

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled Al- pacaEval: A simple way to debias automatic evaluators. InCOLM, 2024

2024
[39]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InNeurIPS, 2023

2023
[40]

Ebel and David A

Robert L. Ebel and David A. Frisbie.Essentials of Educational Measurement. Prentice-Hall, 5th edition, 1991

1991
[41]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

1993
[42]

When the majority is wrong: Modeling annotator dis- agreement for subjective tasks

Eve Fleisig, Rediet Abebe, and Dan Klein. When the majority is wrong: Modeling annotator dis- agreement for subjective tasks. InEMNLP, 2023

2023
[43]

Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Frieder, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

2025
[44]

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, et al. Language models scale reliably with over-training and on downstream tasks. InNeurIPS, 2024

2024
[45]

Predictability and surprise in large generative models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, et al. Predictability and surprise in large generative models. InFAccT, 2022

2022
[47]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization.arXiv preprint arXiv:2210.10760, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´e III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

2021
[49]

gar- den of forking paths

Andrew Gelman and Eric Loken. The statistical crisis in science: Data-dependent analysis—a “gar- den of forking paths”.American Scientist, 102(6):460–465, 2014. 24

2014
[50]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

work page arXiv 2025
[52]

Clara E. Hill. Manual for the hill counselor verbal response category system (revised).Unpublished manuscript, University of Maryland, 1985

1985
[53]

Train- ing compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, et al. Train- ing compute-optimal large language models. InNeurIPS, 2022

2022
[54]

Holland and Dorothy T

Paul W. Holland and Dorothy T. Thayer. Differential item performance and the Mantel-Haenszel procedure. In Howard Wainer and Henry I. Braun, editors,Test Validity, pages 129–145. Lawrence Erlbaum, 1988

1988
[55]

Predicting emergent abilities with infinite resolution evaluation

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, et al. Predicting emergent abilities with infinite resolution evaluation. InICLR, 2024

2024
[56]

Heart: A unified benchmark for humans and llms in emotional support dialogue

Mrinank Iyer, Karan Aggarwal, Sanmi Koyejo, et al. Heart: A unified benchmark for humans and llms in emotional support dialogue. InarXiv preprint arXiv:2601.19922, 2026

work page arXiv 2026
[57]

LiveCodeBench: Holistic and contamination free eval- uation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free eval- uation of large language models for code. InICLR, 2025

2025
[58]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[59]

Rebecca Killick, Paul Fearnhead, and Idris A. Eckley. Optimal detection of changepoints with a linear computational cost.Journal of the American Statistical Association, 107(500):1590–1598, 2012

2012
[60]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InICLR, 2024

2024
[61]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InEMNLP, 2024

2024
[62]

Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024
[63]

Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

work page arXiv 2024
[64]

Specification gaming: The flip side of AI ingenuity

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020

2020
[65]

Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011. 25

2011
[66]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

work page arXiv 2024
[67]

Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo

John P. Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo. Item response theory for natural language processing. InEACL Tutorial Abstracts, 2024

2024
[68]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In EMNLP, pages 648–657, 2016

2016
[69]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. InEMNLP-IJCNLP, 2019

2019
[70]

Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

Xinyu Li et al. Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

work page arXiv 2024
[71]

LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

Yen-Ting Lin and Yun-Nung Chen. LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. InNLP4ConvAI Workshop, 2023

2023
[72]

Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

Hao Liu et al. Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

work page arXiv 2025
[73]

Towards emotional support dialog systems

Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. InACL, 2021

2021
[74]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP,
[75]

Calibrat- ing LLM-based evaluator

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, et al. Calibrat- ing LLM-based evaluator. InLREC-COLING, 2024

2024
[76]

Lord and Melvin R

Frederic M. Lord and Melvin R. Novick.Statistical Theories of Mental Test Scores. Addison-Wesley, 1968

1968
[77]

Data contamination: From memorization to exploitation

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation. InACL, 2022

2022
[78]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[79]

Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

1959
[80]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InFAT*, pages 220–229, 2019

2019
[82]

Munaf `o, Brian A

Marcus R. Munaf `o, Brian A. Nosek, Dorothy V . M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, et al. A manifesto for reproducible science.Nature Human Behaviour, 1(1):0021, 2017. 26

2017

Showing first 80 references.

[1] [1]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[2] [2]

MentalChat16K: A benchmark dataset for conversational mental health assistance

Anonymous. MentalChat16K: A benchmark dataset for conversational mental health assistance. arXiv preprint arXiv:2503.13509, 2025

work page arXiv 2025

[3] [3]

PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

Anonymous. PsychiatryBench: A multi-task benchmark for LLMs in psychiatry.arXiv preprint arXiv:2509.09711, 2025

work page arXiv 2025

[4] [4]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Taylor, Mark D´ıaz, Christopher M

Lora Aroyo, Alex S. Taylor, Mark D´ıaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garc´ıa, Vinodkumar Prabhakaran, and Ding Wang. DICES dataset: Diversity in conversational AI evaluation for safety.arXiv preprint arXiv:2306.11247, 2023

work page arXiv 2023

[6] [6]

Truth is a lie: Crowd truth and the seven myths of human annotation

Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. InAI Magazine, volume 36, pages 15–24, 2015

2015

[7] [7]

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

CounselBench Authors. CounselBench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

Akram Badawi, Md Tahmid Rahman Laskar, Hossein Rahimi, et al. Assessing the quality of large language models for mental health support: A multi-attribute evaluation.arXiv preprint arXiv:2601.18630, 2026

work page arXiv 2026

[9] [9]

MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InACL, 2024

2024

[10] [10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. Train- ing a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

Yushi Bai et al. Benchmarking foundation models with language-model-as-an-examiner.arXiv preprint arXiv:2306.04181, 2024

work page arXiv 2024

[13] [13]

We need to consider disagreement in evaluation

Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, and Alexandra Uma. We need to consider disagreement in evaluation. InBPPF Workshop, ACL, 2021

2021

[14] [14]

LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern ´andez, Al- bert Gatt, et al. LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks. InACL, 2025

2025

[15] [15]

A systematic review of repro- ducibility research in natural language processing

Anya Belz, Shubham Agarwal, Anastasia Shimorina, and Ehud Reiter. A systematic review of repro- ducibility research in natural language processing. InEACL, 2021. 22

2021

[16] [16]

Some latent trait models and their use in inferring an examinee’s ability

Allan Birnbaum. Some latent trait models and their use in inferring an examinee’s ability. InStatis- tical Theories of Mental Test Scores, pages 397–479. Addison-Wesley, 1968

1968

[17] [17]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[18] [18]

Brennan.Generalizability Theory

Robert L. Brennan.Generalizability Theory. Springer, New York, 2001

2001

[19] [19]

Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

William Brown. Some experimental results in the correlation of mental abilities.British Journal of Psychology, 3(3):296–322, 1910

1910

[20] [20]

Burleson

Brant R. Burleson. Emotional support skills. InHandbook of Communication and Social Interaction Skills. Lawrence Erlbaum Associates, 2003

2003

[21] [22]

Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws.ICLR 2023 (also arXiv:2210.14891), 2023

work page arXiv 2023

[22] [23]

Toward a perspectivist turn in ground truthing for predictive computing

Federico Cabitza, Andrea Campagner, and Valerio Basile. Toward a perspectivist turn in ground truthing for predictive computing. InAAAI, 2023

2023

[23] [24]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´er´emy Scheurer, Javier Rando, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [25]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. ChatEval: Towards better LLM-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Humans or LLMs as the judge? a study on judgement biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement biases. InEMNLP, 2024

2024

[26] [27]

Can large language models be an alternative to human evalua- tions? InACL, 2023

Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalua- tions? InACL, 2023

2023

[27] [28]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993

1993

[29] [30]

Lawrence Erlbaum Associates, 1996

Norman Cliff.Ordinal Methods for Behavioral Data Analysis. Lawrence Erlbaum Associates, 1996

1996

[30] [31]

Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E

Kevin Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany J. Callahan, Orin Hargraves, Fos- ter Goss, Nancy Ide, Aur ´elie N´ev´eol, Cyril Grouin, and Lawrence E. Hunter. Three dimensions of reproducibility in natural language processing. InLREC, 2018

2018

[31] [32]

TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024

Jonathan Cook, Tim Rockt ¨aschel, Jakob Foerster, Dennis Aumiller, and Alex Wang. TICK- ing all the boxes: Generated checklists improve LLM evaluation and generation.arXiv preprint arXiv:2410.03608, 2024. 23

work page arXiv 2024

[32] [33]

Cronbach and Paul E

Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4):281–302, 1955

1955

[33] [34]

Dealing with disagreements: Looking beyond the majority vote in subjective annotations

Aida Mostafazadeh Davani, Mark D ´ıaz, and Vinodkumar Prabhakaran. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. InNAACL, 2022

2022

[34] [35]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models.arXiv preprint arXiv:2406.10162, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [36]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InEMNLP-IJCNLP, pages 2185–2194, 2019

2019

[36] [37]

Understanding emergent abilities of lan- guage models from the loss perspective

Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang. Understanding emergent abilities of lan- guage models from the loss perspective. InNeurIPS, 2024

2024

[37] [38]

Hashimoto

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled Al- pacaEval: A simple way to debias automatic evaluators. InCOLM, 2024

2024

[38] [39]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InNeurIPS, 2023

2023

[39] [40]

Ebel and David A

Robert L. Ebel and David A. Frisbie.Essentials of Educational Measurement. Prentice-Hall, 5th edition, 1991

1991

[40] [41]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

1993

[41] [42]

When the majority is wrong: Modeling annotator dis- agreement for subjective tasks

Eve Fleisig, Rediet Abebe, and Dan Klein. When the majority is wrong: Modeling annotator dis- agreement for subjective tasks. InEMNLP, 2023

2023

[42] [43]

Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessandra Teresa Cignarella, Cristina Frieder, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, 59(2), 2025

2025

[43] [44]

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, et al. Language models scale reliably with over-training and on downstream tasks. InNeurIPS, 2024

2024

[44] [45]

Predictability and surprise in large generative models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, et al. Predictability and surprise in large generative models. InFAccT, 2022

2022

[45] [47]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization.arXiv preprint arXiv:2210.10760, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [48]

Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´e III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

2021

[47] [49]

gar- den of forking paths

Andrew Gelman and Eric Loken. The statistical crisis in science: Data-dependent analysis—a “gar- den of forking paths”.American Scientist, 102(6):460–465, 2014. 24

2014

[48] [50]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [51]

LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2025

work page arXiv 2025

[50] [52]

Clara E. Hill. Manual for the hill counselor verbal response category system (revised).Unpublished manuscript, University of Maryland, 1985

1985

[51] [53]

Train- ing compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, et al. Train- ing compute-optimal large language models. InNeurIPS, 2022

2022

[52] [54]

Holland and Dorothy T

Paul W. Holland and Dorothy T. Thayer. Differential item performance and the Mantel-Haenszel procedure. In Howard Wainer and Henry I. Braun, editors,Test Validity, pages 129–145. Lawrence Erlbaum, 1988

1988

[53] [55]

Predicting emergent abilities with infinite resolution evaluation

Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, et al. Predicting emergent abilities with infinite resolution evaluation. InICLR, 2024

2024

[54] [56]

Heart: A unified benchmark for humans and llms in emotional support dialogue

Mrinank Iyer, Karan Aggarwal, Sanmi Koyejo, et al. Heart: A unified benchmark for humans and llms in emotional support dialogue. InarXiv preprint arXiv:2601.19922, 2026

work page arXiv 2026

[55] [57]

LiveCodeBench: Holistic and contamination free eval- uation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free eval- uation of large language models for code. InICLR, 2025

2025

[56] [58]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[57] [59]

Rebecca Killick, Paul Fearnhead, and Idris A. Eckley. Optimal detection of changepoints with a linear computational cost.Journal of the American Statistical Association, 107(500):1590–1598, 2012

2012

[58] [60]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InICLR, 2024

2024

[59] [61]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InEMNLP, 2024

2024

[60] [62]

Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024

[61] [63]

Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Bench- marking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012, 2024

work page arXiv 2024

[62] [64]

Specification gaming: The flip side of AI ingenuity

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020

2020

[63] [65]

Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability.Annenberg School for Communi- cation, University of Pennsylvania, Departmental Papers, 2011. 25

2011

[64] [66]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models.arXiv preprint arXiv:2401.16745, 2024

work page arXiv 2024

[65] [67]

Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo

John P. Lalor, Pedro Rodriguez, Jo ˜ao Sedoc, and Jos ´e Hern´andez-Orallo. Item response theory for natural language processing. InEACL Tutorial Abstracts, 2024

2024

[66] [68]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In EMNLP, pages 648–657, 2016

2016

[67] [69]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. InEMNLP-IJCNLP, 2019

2019

[68] [70]

Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

Xinyu Li et al. Autobencher: Automated benchmark generation.arXiv preprint arXiv:2407.08351, 2024

work page arXiv 2024

[69] [71]

LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

Yen-Ting Lin and Yun-Nung Chen. LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. InNLP4ConvAI Workshop, 2023

2023

[70] [72]

Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

Hao Liu et al. Arenabencher: Item-evolution benchmarking via multi-model competition.arXiv preprint arXiv:2510.08569, 2025

work page arXiv 2025

[71] [73]

Towards emotional support dialog systems

Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. Towards emotional support dialog systems. InACL, 2021

2021

[72] [74]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP,

[73] [75]

Calibrat- ing LLM-based evaluator

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, et al. Calibrat- ing LLM-based evaluator. InLREC-COLING, 2024

2024

[74] [76]

Lord and Melvin R

Frederic M. Lord and Melvin R. Novick.Statistical Theories of Mental Test Scores. Addison-Wesley, 1968

1968

[75] [77]

Data contamination: From memorization to exploitation

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation. InACL, 2022

2022

[76] [78]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[77] [79]

Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute, 22(4):719–748, 1959

1959

[78] [80]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [81]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InFAT*, pages 220–229, 2019

2019

[80] [82]

Munaf `o, Brian A

Marcus R. Munaf `o, Brian A. Nosek, Dorothy V . M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, et al. A manifesto for reproducible science.Nature Human Behaviour, 1(1):0021, 2017. 26

2017