pith. machine review for the scientific record. sign in

arxiv: 2604.05593 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Abdallah El Ali, Di Wu, Isao Echizen, Saku Sugawara, Sijing Qin, Xin Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords label biasLLM-as-a-Judgetrust assessmentheuristic cuesource disclosureAI evaluationhuman-AI comparisonattention patterns
0
0 comments X

The pith

Both humans and LLMs assign higher trust to identical information when labeled human-authored than when labeled AI-generated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether disclosed source labels shape trust judgments in evaluation settings. It presents the same content under swapped labels and measures responses from people and from LLMs used as judges. Both groups show higher trust for the human label. Eye-tracking and internal model analysis reveal that labels function as quick heuristic cues, with attention and uncertainty patterns aligning between humans and models. The finding questions the fairness of label-exposed LLM evaluations and suggests that aligning models to human judgments may carry over the same shortcut reliance.

Core claim

Using a counterfactual setup that holds all text constant and varies only the source label, the work shows that trust ratings rise when content is marked as human-authored and fall when marked as AI-generated. This pattern appears in both human participants and LLM judges. Model attention concentrates more on the label region than the content region, with stronger label focus under human labels, while decision logits indicate greater uncertainty under AI labels. These internal patterns match the human eye-tracking data, indicating that the source label acts as a shared heuristic cue.

What carries the argument

The counterfactual design that isolates the source label by presenting identical content under human versus AI authorship disclosures.

If this is right

  • LLM-as-a-Judge systems may systematically undervalue AI-generated outputs when source labels are visible.
  • Alignment procedures that train on human preferences risk embedding label-based heuristics into model behavior.
  • Evaluation validity suffers when labels are disclosed, because judgments track the label more than the content.
  • Attention and uncertainty metrics in LLMs can serve as detectable signals of this heuristic reliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same label effect could appear in other judgment tasks such as quality scoring or fact-checking beyond trust.
  • Blinding source labels during both human and model evaluation might eliminate the bias and produce more content-focused assessments.
  • If training data contain labeled examples, models may learn to overweight labels even when labels are not explicitly provided at inference time.

Load-bearing premise

The experiment successfully keeps every factor except the source label identical across conditions, so no other cue influences the trust difference.

What would settle it

A replication in which the source label is removed or replaced with a neutral marker and the trust gap between the former human and AI conditions disappears for both humans and LLMs.

Figures

Figures reproduced from arXiv: 2604.05593 by Abdallah El Ali, Di Wu, Isao Echizen, Saku Sugawara, Sijing Qin, Xin Sun.

Figure 1
Figure 1. Figure 1: Three studies examine how source labels affect trust judgments by an LLM-as-a-judge and investigate the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example heatmap of gaze points on stimuli that are displayed on the lab monitor during the human [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Left): LLM’s attention allocation between two AoAs (i.e., label vs. content) and (Right): LLM’s logit entropy, across two label conditions (i.e., Human vs. AI). (**p<.01, *p<.05). Mean of Fixation Count Mean of Fixation Duration 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0 - AOI-Label AOI-Label AOI-Content AOI-Content [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analyses by GEE test (Hardin and Hilbe, 2012) with FDR correction (Haynes, 2013) of human gaze patterns (i.e., fixation count and fixation duration) in two AoIs. (**p<.01, *p<.05, “ns” is not significant). allocation and decision confidence. Across mod￾els, attention is consistently label-dominant, with stronger label-AoA attention under Human than AI labels, while decision uncertainty is highest under AI … view at source ↗
Figure 7
Figure 7. Figure 7: Instruction shown to the participants during [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Human trust ratings in Study 1. Top: Trust score distributions for the four (2 × 2) conditions cross￾ing true answer source (Human vs. LLM) and disclosed label (Human vs. AI). Bottom: Main effects collapsed across the other factor: trust scores by true source (re￾gardless of label) and by disclosed label (regardless of true source). Violins show density; red dots indicate means; black lines indicate median… view at source ↗
Figure 9
Figure 9. Figure 9: Trust scores of LLM-as-a-Judge under three label conditions from Study 3 (Sec. 3.3). Mean trust ratings (with error bars) produced by each model when the same health QA content is shown across three labels: Human, AI, and a non-semantical Placebo label as “[TAG]”. Horizontal brackets indicate significant pair￾wise differences (∗∗∗p < .001, ∗∗p < .01, ∗p < .05). Attention Density (LogRatio) Logit Entropy (E… view at source ↗
Figure 10
Figure 10. Figure 10: reports the attention allocation “LogRa￾tio” between Label AoA and Content AoA across three label conditions. Across all models and condi￾tions, LogRatio is consistently above zero, indicat￾ing consistent label-dominant attention allocation at the judgment step. The placebo condition often elicits the largest LogRatio, suggesting that an un￾derspecified label can attract extra processing to the label regi… view at source ↗
Figure 11
Figure 11. Figure 11: shows the logits entropy at the judgment step under three label conditions. Across models, the AI label generally yields higher entropy than Trust Scores [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: LLM attention distribution density across [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that source labels (human-authored vs. AI-generated) bias trust judgments in both humans and LLMs used as judges. Using a counterfactual design that holds content constant while manipulating only the label, the authors report higher trust ratings for human-labeled content. Human data include explicit ratings and eye-tracking fixation metrics showing heavy label reliance; LLM data include judgment outputs, denser attention weights on label tokens (stronger for human labels), and higher logit-based uncertainty for AI labels. These patterns are interpreted as evidence of shared heuristic reliance on source labels, raising concerns for the validity of LLM-as-a-Judge evaluations and potential propagation of biases via alignment.

Significance. If the central empirical result holds under the reported controls, the work is significant for AI evaluation research because it identifies a concrete, measurable bias that affects both human and model judgments in the same direction. The convergence between behavioral (eye-tracking) and internal-state (attention and uncertainty) measures provides a rare cross-species link between cognitive heuristics and model mechanisms, directly supporting the call for debiased evaluation protocols.

minor comments (2)
  1. Abstract: the summary of results would be strengthened by including at least the total sample sizes for human participants and LLM trials, along with the primary statistical test outcomes or effect sizes that support the trust difference claim.
  2. The description of LLM attention analysis should clarify how label-region attention weights are normalized and aggregated across layers to ensure direct comparability with the human eye-tracking fixation metrics.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of our findings, and the recommendation for minor revision. We are pleased that the cross-species convergence between human behavioral measures and LLM internal states was recognized as significant for AI evaluation research.

Circularity Check

0 steps flagged

Empirical study with no derivational chain or self-referential reductions

full rationale

This paper reports an empirical observational study using counterfactual designs to compare trust judgments by humans and LLMs under manipulated source labels. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on experimental measurements (ratings, eye-tracking, attention weights, logits) with reported controls for confounds, and no load-bearing self-citations or uniqueness theorems are invoked to justify core results. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the counterfactual experimental design and the interpretation of eye-tracking and model internals as direct evidence of heuristic reliance. No free parameters, new entities, or non-standard axioms are mentioned.

axioms (1)
  • standard math Standard assumptions of experimental psychology and statistical inference hold for trust ratings, gaze data, and model logits.
    Invoked implicitly to interpret differences across label conditions as evidence of heuristic reliance.

pith-pipeline@v0.9.0 · 5500 in / 1253 out tokens · 69878 ms · 2026-05-10T19:34:29.741413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    Tobii AB. 2024. http://www.tobii.com/ Tobii pro lab . Computer software

  2. [2]

    Benjamin R Bates, Sharon Romina, Rukhsana Ahmed, and Danielle Hopson. 2006. The effect of source credibility on consumers' perceptions of the quality of health information on the internet. Medical informatics and the Internet in medicine, 31(1):45--52

  3. [3]

    Oliver Brady, Paul Nulty, Lili Zhang, Tom \'a s E Ward, and David P McGovern. 2025. Dual-process theory and decision-making in large language models. Nat. Rev. Psychol., 4(12):777--792

  4. [4]

    Cacioppo, Louis G

    John T. Cacioppo, Louis G. Tassinary, and Gary G. Berntson. 2016. Strong Inference in Psychophysiological Science, page 3–15. Cambridge Handbooks in Psychology. Cambridge University Press

  5. [5]

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.474 Humans or LLM s as the judge? a study on judgement bias . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301--8327, Miami, Florida, USA. Association for Computational Linguistics

  6. [6]

    Vanessa Cheung, Maximilian Maier, and Falk Lieder. 2025. https://doi.org/10.1073/pnas.2412015122 Large language models show amplified cognitive biases in moral decision-making . Proceedings of the National Academy of Sciences, 122(25):e2412015122

  7. [7]

    Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. https://doi.org/10.1145/3637528.3671882 Neural retrievers are biased towards llm-generated content . In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24, page 526–537, New York, NY, USA. Association for...

  8. [8]

    Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D\" u nner. 2024. Questioning the survey responses of large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA. Curran Associates Inc

  9. [9]

    Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.739 Cognitive bias in decision-making with LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12640--12653, Miami, Florida, USA. Association for Computational Linguistics

  10. [10]

    Abdallah El Ali, Karthikeya Puttur Venkatraj, Sophie Morosoli, Laurens Naudts, Natali Helberger, and Pablo Cesar. 2024. https://doi.org/10.1145/3613905.3650750 Transparent ai disclosure obligations: Who, what, when, where, why, how . In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA '24, New York, NY, USA. Associati...

  11. [11]

    Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods, 39(2):175--191

  12. [12]

    Bertram Gawronski, Dillon M Luke, and Laura A Creighton. 2024. Dual-process theories. In The Oxford Handbook of Social Cognition, Second Edition, pages 319--353. Oxford University Press

  13. [13]

    Ellen R Girden. 1992. ANOVA: Repeated measures. 84. Sage

  14. [14]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. https://arxiv.org/abs/2411.15594 A survey on llm-as-a-judge . Preprint, arXiv:2411.15594

  15. [15]

    Rajarshi Haldar and Julia Hockenmaier. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1361 Rating roulette: Self-inconsistency in LLM -as-a-judge frameworks . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24986--25004, Suzhou, China. Association for Computational Linguistics

  16. [16]

    James W Hardin and Joseph M Hilbe. 2012. Generalized estimating equations, second edition, 2 edition. Chapman & Hall/CRC, Philadelphia, PA

  17. [17]

    Winston Haynes. 2013. https://doi.org/10.1007/978-1-4419-9863-7\_1215 Benjamini--Hochberg Method , pages 78--78. Springer New York, New York, NY

  18. [18]

    InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems

    Maurice Jakesch, Megan French, Xiao Ma, Jeffrey T. Hancock, and Mor Naaman. 2019. https://doi.org/10.1145/3290605.3300469 Ai-mediated communication: How the perception that profile text was written by ai affects trustworthiness . In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI '19, page 1–13, New York, NY, USA. Associa...

  19. [19]

    Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2024. A survey on human preference learning for large language models. arXiv preprint arXiv:2406.11191

  20. [20]

    Johnson, Jennifer E

    Frances C. Johnson, Jennifer E. Rowley, and Laura Sbaffi. 2015. https://api.semanticscholar.org/CorpusID:206454953 Modelling trust formation in health information contexts . Journal of Information Science, 41:415 -- 429

  21. [21]

    Marcel A Just and Patricia A Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychol. Rev., 87(4):329--354

  22. [22]

    Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin. 2025. https://doi.org/10.1162/TACL.a.58 Large language models are human-like internally . Transactions of the Association for Computational Linguistics, 13:1743--1766

  23. [23]

    Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenčiak, Ada Böhm, and Jan Kulveit. 2025. https://doi.org/10.1073/pnas.2415697122 Ai–ai bias: Large language models favor communications generated by large language models . Proceedings of the National Academy of Sciences, 122(31):e2415697122

  24. [24]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.138 From generation to judgment: Opportunities and challenges of LLM -as-a-judge . In Proceedings of the 2025 Conference on Em...

  25. [25]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579

  26. [26]

    Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, and Haixiang Hu. 2025 b . https://arxiv.org/abs/2506.22316 Evaluating scoring bias in llm-as-a-judge . Preprint, arXiv:2506.22316

  27. [27]

    Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. 2025 c . https://arxiv.org/abs/2506.09443 Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge . Preprint, arXiv:2506.09443

  28. [28]

    Shyam Sundar

    Q.Vera Liao and S. Shyam Sundar. 2022. https://doi.org/10.1145/3531146.3533182 Designing for responsible trust in ai systems: A communication perspective . In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT '22, page 1257–1268, New York, NY, USA. Association for Computing Machinery

  29. [29]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://arxiv.org/abs/2303.16634 G-eval: Nlg evaluation using gpt-4 with better human alignment . Preprint, arXiv:2303.16634

  30. [30]

    Joao Marecos, Duarte Tude Graça, Francisco Goiana-da Silva, Hutan Ashrafian, and Ara Darzi. 2024. https://doi.org/10.3390/journalmedia5020046 Source credibility labels and other nudging interventions in the context of online health misinformation: A systematic literature review . Journalism and Media, 5(2):702--717

  31. [31]

    Arash Marioriyad, Mohammad Hossein Rohban, and Mahdieh Soleymani Baghshah. 2025. https://arxiv.org/abs/2509.26072 The silent judge: Unacknowledged shortcut bias in llm-as-a-judge . Preprint, arXiv:2509.26072

  32. [32]

    OpenAI. 2024. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

  33. [33]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

  34. [34]

    Jonathan Peirce, Jeremy R Gray, Sol Simpson, Michael MacAskill, Richard H \"o chenberger, Hiroyuki Sogo, Erik Kastman, and Jonas Kristoffer Lindel v. 2019. PsychoPy2 : Experiments in behavior made easy. Behavior Research Methods, 51(1):195--203

  35. [35]

    Moritz Reis, Florian Reis, and Wilfried Kunde. 2024. Influence of believed AI involvement on the perception of digital medical advice. Nature Medicine

  36. [36]

    Bernard Rosner, Robert J Glynn, and Mei-Ling T Lee. 2006. The wilcoxon signed rank test for paired comparisons of clustered data. Biometrics, 62(1):185--192

  37. [37]

    Rowley, Frances C

    Jennifer E. Rowley, Frances C. Johnson, and Laura Sbaffi. 2015. https://api.semanticscholar.org/CorpusID:21888204 Students’ trust judgements in online health information seeking . Health Informatics Journal, 21:316 -- 327

  38. [38]

    Ali, Angèle Christin, Andrew Smart, and Riitta Katila

    Nicolas Scharowski, Michaela Benk, Swen J. KÌhne, Léane Wettstein, and Florian BrÌhlmann. 2023. https://doi.org/10.1145/3593013.3593994 Certification Labels for Trustworthy AI : Insights From an Empirical Mixed - Method Study . In 2023 ACM Conference on Fairness , Accountability , and Transparency , pages 248--260, Chicago IL USA. ACM

  39. [39]

    Kayla Schroeder and Zach Wood-Doughty. 2025. https://arxiv.org/abs/2412.12509 Can you trust llm judgments? reliability of llm-as-a-judge . Preprint, arXiv:2412.12509

  40. [40]

    Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.569 Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11297--11339, Suzhou, China. Association for Comp...

  41. [41]

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. 2025. https://arxiv.org/abs/2406.07791 Judging the judges: A systematic study of position bias in llm-as-a-judge . Preprint, arXiv:2406.07791

  42. [42]

    Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. 2025. https://arxiv.org/abs/2508.06709 Play favorites: A statistical method to measure self-bias in llm-as-a-judge . Preprint, arXiv:2508.06709

  43. [43]

    Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. 2024. https://doi.org/10.1162/tacl_a_00685 Do LLM s exhibit human-like response biases? a case study in survey design . Transactions of the Association for Computational Linguistics, 12:1011--1026

  44. [44]

    Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. 2025 a . https://arxiv.org/abs/2504.09946 Assessing judging bias in large reasoning models: An empirical study . Preprint, arXiv:2504.09946

  45. [45]

    Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, and Shikun Zhang. 2025 b . https://arxiv.org/abs/2509.21117 Trustjudge: Inconsistencies of llm-as-a-judge and how to alleviate them . Preprint, arXiv:2509.21117

  46. [46]

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. 2025. https://arxiv.org/abs/2410.21819 Self-preference bias in llm-as-a-judge . Preprint, arXiv:2410.21819

  47. [47]

    Sarah Wiegreffe and Yuval Pinter. 2019. https://doi.org/10.18653/v1/D19-1002 Attention is not not explanation . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11--20, Hong Kong, China. Association for Computational Linguistics

  48. [48]

    Torr, Bernard Ghanem, and Guohao Li

    Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, James Evans, Philip H.S. Torr, Bernard Ghanem, and Guohao Li. 2024. Can large language model agents simulate human trust behavior? In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red ...

  49. [49]

    Shweta Yadav, Deepak Gupta, and Dina Demner-Fushman. 2022. https://arxiv.org/abs/2206.06581 Chq-summ: A dataset for consumer healthcare question summarization . Preprint, arXiv:2206.06581

  50. [50]

    Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, and Taha Kass-Hout. 2025. https://arxiv.org/abs/2505.17100 Any large language model can be a reliable judge: Debiasing with a reasoning-based bias detector . Preprint, arXiv:2505.17100

  51. [51]

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2024. https://arxiv.org/abs/2410.02736 Justice or prejudice? quantifying biases in llm-as-a-judge . Preprint, arXiv:2410.02736

  52. [52]

    Yidan Yin, Nan Jia, and Cheryl J. Wakslak. 2024. https://doi.org/10.1073/pnas.2319112121 Ai can help people feel heard, but an ai label diminishes this impact . Proceedings of the National Academy of Sciences, 121(14):e2319112121

  53. [53]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2306.05685 Judging llm-as-a-judge with mt-bench and chatbot arena . Preprint, arXiv:2306.05685

  54. [54]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  55. [55]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...