arxiv: 2605.04972 · v1 · submitted 2026-05-06 · 💻 cs.CL

Recognition: unknown

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Chung-Chi Chen, Lung-Hao Lee, Tatsuya Ishigaki, Tzu-Mi Lin, Wataru Hirota

Pith reviewed 2026-05-08 16:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords expert alignmentsubjective evaluationlanguage model alignmenttacit criteriaevaluation dimensionshuman-AI alignmentjudgment variability

0 comments

The pith

Subjective expert judgment varies widely across individuals, resists full capture by explicit rules, and shifts over time and dimension, making reliable alignment with language models difficult.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how different forms of expert input affect language model alignment on subjective evaluation tasks. It runs expert ratings with and without explicit criteria, followed by questionnaires and controlled editing experiments, and tracks performance across multiple dimensions and experts. Four patterns emerge consistently: alignment difficulty differs markedly by expert, explicit criteria and reasoning often add little, small edits give temporary but unstable gains, and dimensions tied directly to content align better than those involving external knowledge or values. These results point to the conclusion that the core obstacle is not only model limits but the heterogeneous, partly tacit, dimension-dependent, and temporally unstable character of subjective judgment itself.

Core claim

The authors establish that expert alignment in subjective tasks is limited by the inherent character of expert judgment: it differs substantially across experts, is not fully expressed by verbalized criteria, responds unstably to small numbers of editing examples, and is easier on dimensions grounded in proposal content than on those requiring external knowledge or value judgments. Together these patterns indicate that subjective evaluation is heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

What carries the argument

Controlled comparisons of model alignment performance with and without explicit expert criteria, across varying numbers and identities of editing examples, and across distinct evaluation dimensions, supplemented by follow-up questionnaires that probe the tacit elements of judgment.

If this is right

Alignment methods must accommodate large individual differences in expert evaluation styles rather than assuming a uniform target.
Verbalizing criteria and reasoning captures only part of expert judgment and cannot be relied on to close the alignment gap.
Gains from providing editing examples remain small and unstable, so repeated or carefully chosen examples are needed to maintain improvements.
Dimensions based directly on proposal content will align more readily than those involving external knowledge or value-based assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment systems could incorporate explicit modeling of expert disagreement and temporal drift instead of treating judgments as fixed targets.
Similar heterogeneity may appear in other subjective domains such as ethical review or creative assessment, suggesting the need for uncertainty-aware alignment techniques.
Benchmarks for subjective tasks should include repeated evaluations over time and multiple experts to avoid overestimating alignment success.

Load-bearing premise

The patterns found with the specific experts, tasks, questionnaires, and editing experiments reflect general properties of subjective judgment rather than being limited to this particular study setup.

What would settle it

A replication with a different and larger group of experts on new subjective tasks that produced stable high alignment across all conditions, regardless of criteria or edits, would falsify the claim that subjectivity itself drives the difficulty.

Figures

Figures reproduced from arXiv: 2605.04972 by Chung-Chi Chen, Lung-Hao Lee, Tatsuya Ishigaki, Tzu-Mi Lin, Wataru Hirota.

**Figure 1.** Figure 1: Average alignment accuracy as a function of view at source ↗

**Figure 2.** Figure 2: Mean accuracy gain from a single edit ex view at source ↗

read the original abstract

Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model's prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that expert alignment with LLMs in subjective evaluation tasks is difficult due to inherent properties of subjective judgment. Through expert evaluations, follow-up questionnaires, and editing experiments, it identifies four consistent patterns: substantial variation in alignment difficulty across experts, limited or inconsistent gains from providing explicit criteria and reasoning, sensitivity of editing-based alignment to the number and identity of examples, and greater alignment difficulty on dimensions requiring external knowledge or value judgments compared to those grounded in proposal content. The authors conclude that subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

Significance. If the patterns are shown to be robust and not artifacts of the specific study setup, this work would make a meaningful contribution to AI alignment research in NLP by shifting emphasis from model-centric fixes to the fundamental characteristics of human subjective judgment. The empirical focus on real expert interactions and dimension-specific analysis provides concrete observations that could inform more effective alignment protocols for subjective tasks.

major comments (2)

[Abstract] Abstract: The abstract reports 'four consistent patterns' and concludes that subjective evaluation is 'inherently' heterogeneous, partly tacit, dimension-dependent, and temporally unstable, but supplies no information on sample size, expert sampling method, statistical analysis, controls, exclusion criteria, or how temporal instability was measured. This is load-bearing for the central claim, as the generalization from observed patterns to inherent properties of judgment requires evidence that the results are not limited to the particular experts, tasks, or questionnaires used.
[Findings and Discussion] Findings and Discussion: The claim that the patterns reflect general properties rather than study-specific factors rests on the absence of reported details separating individual differences from broader effects or demonstrating replication across domains. Without these, the attribution of alignment difficulty to inherent properties of subjective evaluation remains an extrapolation whose validity cannot be assessed from the reported evidence.

minor comments (1)

[Abstract] The abstract would be strengthened by a brief statement of study scale (e.g., number of experts or evaluations performed) to contextualize the strength of the 'consistent patterns' reported.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful and constructive review. We agree that the abstract requires additional methodological context to support the central claims and that the discussion should more explicitly address the scope and limitations of generalizing from this study. We will make revisions to both sections as detailed below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports 'four consistent patterns' and concludes that subjective evaluation is 'inherently' heterogeneous, partly tacit, dimension-dependent, and temporally unstable, but supplies no information on sample size, expert sampling method, statistical analysis, controls, exclusion criteria, or how temporal instability was measured. This is load-bearing for the central claim, as the generalization from observed patterns to inherent properties of judgment requires evidence that the results are not limited to the particular experts, tasks, or questionnaires used.

Authors: We accept that the abstract, in its current concise form, omits key study parameters. In the revised manuscript we will expand the abstract to report the sample size (12 experts), the sampling approach (purposive selection of domain specialists with at least five years of proposal-review experience), the method for assessing temporal instability (repeated questionnaires separated by two weeks), and a brief statement that no formal statistical controls or exclusion criteria beyond standard data-quality checks were applied. We will also replace the word 'inherently' with 'appears to exhibit' to signal that the four patterns are empirical observations from this study rather than proven universal features of subjective judgment. revision: yes
Referee: [Findings and Discussion] Findings and Discussion: The claim that the patterns reflect general properties rather than study-specific factors rests on the absence of reported details separating individual differences from broader effects or demonstrating replication across domains. Without these, the attribution of alignment difficulty to inherent properties of subjective evaluation remains an extrapolation whose validity cannot be assessed from the reported evidence.

Authors: We agree that the manuscript does not perform formal variance decomposition to isolate individual differences from domain-level effects, nor does it replicate the protocol in additional domains. The current evidence consists of descriptive consistency across the 12 experts and four evaluation dimensions within a single task (research-proposal assessment). In the revised discussion we will add an explicit limitations subsection that (a) reports the absence of cross-domain replication, (b) notes that individual-expert variability was examined only descriptively, and (c) frames the four patterns as suggestive rather than definitive evidence for broader properties of subjective judgment. We will also reference relevant literature on expert decision-making to contextualize why the observed heterogeneity and tacitness are theoretically plausible, while clearly stating that stronger claims require future multi-domain studies. revision: partial

standing simulated objections not resolved

Replication of the observed patterns across multiple distinct domains or task types, which was outside the scope of the present single-domain study.

Circularity Check

0 steps flagged

No circularity: empirical reporting of experimental patterns

full rationale

This is an empirical study reporting observations from expert evaluations, questionnaires, and editing experiments on model alignment. No mathematical derivations, parameter fittings, or predictions are present that could reduce to inputs by construction. The four patterns are stated as direct findings from the data, and the interpretive summary that subjective evaluation is 'inherently' heterogeneous etc. is an extrapolation from those observations rather than a self-definitional or self-citation load-bearing step. The derivation chain consists solely of experimental results and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from expert studies rather than new theoretical entities or fitted parameters.

axioms (1)

domain assumption Expert judgments in subjective tasks can be meaningfully quantified and compared to model outputs to measure alignment
The study design and findings depend on the ability to measure and compare expert and model evaluations.

pith-pipeline@v0.9.0 · 5512 in / 1225 out tokens · 71426 ms · 2026-05-08T16:07:35.848433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2410.04070 , year =

Chen, Ruizhe and Zhang, Xiaotian and Luo, Meng and Chai, Wenhao , title =. arXiv preprint arXiv:2410.04070 , year =

work page arXiv
[2]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

Dong, Yi and Wang, Zhilin and Sreedhar, Makesh and Wu, Xianchao and Kuchaiev, Oleksii , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

2023
[3]

and Shen, Yelong and Wallis, Phil and Allen

Hu, Edward J. and Shen, Yelong and Wallis, Phil and Allen. LoRA: Low-rank Adaptation of Large Language Models , journal =
[4]

The Thirteenth International Conference on Learning Representations , year =

Fang, Junfengand Jiang, Houcheng and Wang, Kun and Ma, Yunshan and Jie, Shi and Wang, Xiangnan He and Chua, Tat-seng , title=. The Thirteenth International Conference on Learning Representations , year =
[5]

and Wu, Xiao-Ming , title =

Feng,Yujie and Zhan, Li-Ming and Lu, Zexin and Xu, Yongxin and Chu, Xu and Wang, Yasha and Cao, Jiannong and Yu, Philip S. and Wu, Xiao-Ming , title =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

2025
[6]

The Llama 3 Herd of Models , journal =

Grattafiori, Anthony and Dubey, Apoorv and Jauhri, Arun and Pandey, Atul and Kadian, Aditya and Al. The Llama 3 Herd of Models , journal =
[7]

Proceedings of the 2nd Workshop on Agent AI for Scenario Planning , year =

Hirota, Wataru and Chen, Chung-chi and Ohkuma, Tomoko and Taniguchi, Tomoki and Ishigaki, Tatsuya , title =. Proceedings of the 2nd Workshop on Agent AI for Scenario Planning , year =
[8]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistic , year =

Liu, Wenhao and Wang, Xiaohua and Wu, Muling and Li, Tianlong and Lv, Changze and Ling, Zixuan and Zhu, JianHao and Zhang, Cenyuan and Zheng, Xiaoqing and Huang, Xuanjing , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistic , year =
[9]

and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip C

Masoud, Reem I. and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip C. and Rodrigues, Miguel , title =. Proceedings of the 31st International Conference on Computational Linguistics , year =
[10]

Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22) , year =

Meng, Kevin and Bau, David and Andonianm, Alex and Belinkov, Yonatan , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22) , year =
[11]

2022 , journal =

Meng, Kevin and Sharma, Arnab Sen and Andonianm, Alex and Belinkov, Yonatan and Bau, David , title =. arXiv preprint arXiv:2210.07229 , year =

work page arXiv
[12]

and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip C

Masoud, Reem I. and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip C. and Rodrigues, Miguel , title =. Proceedings of the 40th International Conference on Machine Learning (ICML'23) , year =
[13]

Proceedings of the 40th International Conference on Machine Learning (ICML'23) , volume =

Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning (ICML'23) , volume =
[14]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Park, Joon Sung and Zou, Carolyn Q. and Shaw, Aaron and Hill, Benjamin Mako and Cai, Carrie and Morris, Meredith Ringel and Willer, Robb and Liang, Percy and Bernstein, Michael S. , title =. arXiv preprint arXiv:2411.10109 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

Wang, Pengyu and Zhang, Dong and Li, Linyang and Tan, Chenkun and Wang, Xinghao and Zhang, Mozhi and Ren, Ke and Jiang, Botian and Qiu, Xipeng , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

2024
[16]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Zhang, LingXi and Yu, Yue and Wang, Kuan and Zhang, Chao , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
[17]

Findings of the Association for Computational Linguistics: ACL 2025 , year =

Zhang, Xiaotian and Chen, Ruizhe and Feng, Yang and Liu, Zuozhu , title =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

2025
[18]

Proceedings of the 10th International Conference on Learning Representations , year =

Zhang, Xiaotian and Chen, Ruizhe and Feng, Yang and Liu, Zuozhu , title =. Proceedings of the 10th International Conference on Learning Representations , year =
[19]

Finetuned Language Models Are Zero-Shot Learners

Wei, Jason and Bosma, Maarten and Zhao, Vincent Y. and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M. and Le, Quoc V. , title =. arXiv preprint arXiv:2109.01652 , year =

work page internal anchor Pith review arXiv
[20]

Advances in Neural Information Processing Systems , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , year =
[21]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =

2023
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author =. arXiv preprint arXiv:2306.05685 , year =

work page internal anchor Pith review arXiv
[23]

AI Magazine , year =

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , author =. AI Magazine , year =
[24]

Proceedings of the 40th International Conference on Machine Learning , series =

Shibani Santurkar and Esin Durmus and Faisal Ladhak and Cinoo Lee and Percy Liang and Tatsunori Hashimoto , title =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , publisher =

2023
[25]

Transactions of the Association for Computational Linguistics , year =

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations , author =. Transactions of the Association for Computational Linguistics , year =
[26]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , year =

Attention is not Explanation , author =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , year =

2019