LLM Evaluators Recognize and Favor Their Own Generations
Pith reviewed 2026-05-22 18:40 UTC · model grok-4.3
The pith
LLMs can identify their own generations and this recognition causes them to score those outputs higher than equivalent text from other sources.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. Fine-tuning reveals a linear correlation between self-recognition capability and the strength of self-preference bias. Controlled experiments show the causal explanation resists straightforward confounders.
What carries the argument
Self-recognition capability, defined as the accuracy with which an LLM classifies a given text sample as having been generated by itself versus by another source.
If this is right
- Self-preference bias will appear in reward modeling and constitutional AI whenever the same model generates and judges content.
- Benchmarking that uses LLM judges will systematically over-rate outputs matching the judge's own generation style.
- AI safety evaluations relying on self-evaluation risk under-valuing safety properties that differ from the evaluator's own patterns.
- Unbiased automated evaluation requires either separate models for generation and judging or explicit controls that block self-recognition.
Where Pith is reading between the lines
- Training procedures that deliberately obscure a model's own stylistic fingerprints could reduce self-preference without harming other capabilities.
- The same recognition mechanism may create broader familiarity biases when models evaluate any content drawn from distributions they have seen during training.
- Analogous tests could check whether human evaluators show similar preference effects when scoring text from sources whose style they have internalized.
Load-bearing premise
The fine-tuning procedure and controlled experiments isolate self-recognition as the causal driver of self-preference rather than correlated changes in output style, length, or topic distribution.
What would settle it
An experiment that increases self-recognition accuracy through fine-tuning yet leaves self-preference scores unchanged would falsify the claimed causal link.
read the original abstract
Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs such as GPT-4 and Llama 2 exhibit non-trivial out-of-the-box accuracy in distinguishing their own generations from those of other LLMs and humans. Fine-tuning on a self-recognition objective produces a linear correlation between improved self-recognition capability and the strength of self-preference bias in evaluations. Controlled experiments are presented to argue that this relationship is causal and resists straightforward confounders such as changes in output style or distribution.
Significance. If the causal link between self-recognition and self-preference holds after rigorous controls, the result would be significant for LLM-based evaluation pipelines used in benchmarking, reward modeling, constitutional AI, and self-refinement. It would highlight a previously under-examined source of bias that could affect the reliability of automated evaluations and raise implications for AI safety when models evaluate their own outputs.
major comments (2)
- [§4] §4 (Controlled Experiments): The abstract states that controlled experiments demonstrate the causal link resists straightforward confounders, yet the precise controls (length-matched sampling, regression on lexical diversity or perplexity, topic entropy matching, or style-feature covariates) are not enumerated with sufficient detail or statistical reporting. Without these specifics, the linear correlation obtained via fine-tuning could still be driven by correlated shifts in generation properties rather than recognition per se.
- [§3.2] §3.2 (Fine-tuning Procedure): The fine-tuning objective for self-recognition is described at a high level, but the paper does not report whether generation-length, token-distribution, or stylistic statistics were explicitly regularized or measured before and after fine-tuning. If these properties change systematically, they constitute a plausible alternative driver of the observed preference scores.
minor comments (2)
- [Results] Table 1 or equivalent results table: report exact sample sizes, number of generations per model, and confidence intervals or p-values for the out-of-the-box discrimination accuracies.
- [Results] Figure 2 (correlation plot): clarify whether the x-axis (self-recognition accuracy) and y-axis (self-preference delta) are computed on held-out data or on the fine-tuning distribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the exposition of our controls and fine-tuning measurements. We address each major comment below and have prepared revisions that directly incorporate the requested details and statistical reporting.
read point-by-point responses
-
Referee: [§4] §4 (Controlled Experiments): The abstract states that controlled experiments demonstrate the causal link resists straightforward confounders, yet the precise controls (length-matched sampling, regression on lexical diversity or perplexity, topic entropy matching, or style-feature covariates) are not enumerated with sufficient detail or statistical reporting. Without these specifics, the linear correlation obtained via fine-tuning could still be driven by correlated shifts in generation properties rather than recognition per se.
Authors: We agree that the current description of the controls in §4 lacks the granularity needed to fully address potential alternative explanations. In the revised manuscript we have expanded this section to enumerate the controls explicitly and to report the associated statistics. Specifically, we applied length-matched sampling by restricting comparisons to generation pairs whose token lengths differed by at most 5 %; we performed ordinary-least-squares regressions that included type-token ratio and perplexity as covariates; we matched generations on topic entropy derived from LDA models; and we added style-feature covariates (average sentence length, punctuation density, and vocabulary richness) to the preference-score models. The revised text now includes a table of regression results showing that the coefficient on self-recognition accuracy remains positive and significant (p < 0.01) after inclusion of these controls. These additions directly respond to the concern that generation-property shifts could drive the observed correlation. revision: yes
-
Referee: [§3.2] §3.2 (Fine-tuning Procedure): The fine-tuning objective for self-recognition is described at a high level, but the paper does not report whether generation-length, token-distribution, or stylistic statistics were explicitly regularized or measured before and after fine-tuning. If these properties change systematically, they constitute a plausible alternative driver of the observed preference scores.
Authors: We acknowledge that §3.2 would benefit from explicit reporting of these statistics. In the revision we have added a paragraph and an accompanying supplementary table that document the measurements taken before and after fine-tuning. No explicit regularization on length, token distribution, or style was applied during fine-tuning, in order to preserve the model’s natural generation behavior. Post-hoc checks nevertheless show that mean generation length changed by fewer than three tokens, KL divergence between pre- and post-fine-tuning token distributions remained below 0.05, and differences in stylistic metrics (Flesch reading-ease score and type-token ratio) were statistically non-significant (two-sample t-tests, all p > 0.1). These results are now reported so that readers can evaluate whether systematic distributional shifts could explain the preference-score changes. revision: yes
Circularity Check
Empirical study with external benchmarks; no derivation reduces to inputs by construction
full rationale
The paper conducts an empirical investigation using out-of-the-box discrimination accuracy on model outputs, fine-tuning to observe correlations, and controlled experiments comparing to human annotations. No mathematical derivation chain exists that equates a 'prediction' or result to its inputs by definition or self-citation. Fine-tuning introduces hyperparameters but functions as an experimental manipulation rather than a fitted parameter renamed as a prediction. Central claims rely on direct comparisons to external human judgments and other LLMs, making the work self-contained against benchmarks. Any self-citations are not load-bearing for the empirical findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM outputs can be meaningfully compared for quality by both the model itself and human annotators under the same rubric
Forward citations
Cited by 27 Pith papers
-
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Defines GEA validity criterion and reports first measurement of r=0.698 recovery with positive bias in LLM two-stage adaptive assessment, stronger for verifiable skills.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
AMEL: Accumulated Message Effects on LLM Judgments
LLMs exhibit an accumulated message effect where conversation history saturated with positive or negative evaluations biases subsequent judgments, with larger shifts on uncertain items, a negativity asymmetry, and no ...
-
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
A new benchmark uses separate predictor and scorer LLMs to test whether forecast strings improve likelihood of hidden mathematical equation continuations, with controls that detect priming shortcuts.
-
Automated alignment is harder than you think
AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
-
Automated alignment is harder than you think
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
-
Automated alignment is harder than you think
Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
-
When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents
GPT-Image-2 document forgeries evade human and computational detection while traditional tampering remains detectable, with the model itself failing as a self-judge.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
Self-Preference Bias in LLM-as-a-Judge
LLMs judge their own outputs higher because they assign better scores to lower-perplexity text, even when the text is not self-generated.
-
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.
-
Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks
Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.
-
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
-
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
-
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
-
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.
-
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models
Amayuelas, A., Pan, L., Chen, W., and Wang, W. Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models. arXiv preprint arXiv:2305.13712,
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O
Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667,
-
[4]
Visit-bench: A benchmark for vision-language instruction following inspired by real-world use,
Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R., and Schimdt, L. Visit-bench: A benchmark for vision-language instruc- tion following inspired by real-world use. Advances in Neural Information Processing Systems, 2023a. Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R....
-
[5]
ISSN 1939-1854. doi: 10.1037/h0057532. Place: US Publisher: American Psy- chological Association. Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. GPTScore: Evaluate as You Desire, February
-
[6]
GPTScore: Evaluate as You Desire
URL http://arxiv. org/abs/2302.04166. arXiv:2302.04166 [cs]. Hackl, V ., M¨uller, A. E., Granitzer, M., and Sailer, M. Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings. Frontiers in Education, 8:1272229, Decem- ber
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
ISSN 2504-284X. doi: 10.3389/feduc.2023. 1272229. URL http://arxiv.org/abs/2308. 02575. arXiv:2308.02575 [cs]. Hans, A., Schwarzschild, A., Cherepanova, V ., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., and Gold- stein, T. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070,
-
[8]
Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L
Research submission to the Evals research sprint hosted by Apart Research. Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L. V . Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314,
-
[9]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Koo, R., Lee, M., Raheja, V ., Park, J. I., Kim, Z. M., and Kang, D. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012,
-
[11]
A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion
Kumarage, T., Agrawal, G., Sheth, P., Moraffah, R., Chadha, A., Garland, J., and Liu, H. A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion. arXiv preprint arXiv:2403.01152,
-
[12]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V ., and Rastogi, A. RLAIF: Scal- ing reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Scalable agent alignment via reward modeling: a research direction
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V ., and Legg, S. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
original-date: 2023-05- 25T09:35:28Z
URL https://github.com/ tatsu-lab/alpaca_eval. original-date: 2023-05- 25T09:35:28Z. Liu, Y ., Moosavi, N. S., and Lin, C. LLMs as Narcissis- tic Evaluators: When Ego Inflates Evaluation Scores, November
work page 2023
-
[15]
URL https://arxiv.org/abs/ 2311.09766v1. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-Refine: Iterative Re- finement with Self-Feedback, May
-
[16]
Self-Refine: Iterative Refinement with Self-Feedback
URLhttp:// arxiv.org/abs/2303.17651. arXiv:2303.17651 [cs]. Mitchell, E., Lee, Y ., Khazatsky, A., Manning, C. D., and Finn, C. Detectgpt: Zero-shot machine-generated text de- tection using probability curvature. In International Con- ference on Machine Learning, pp. 24950–24962. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https://aclanthology.org/K16-1028. Narayan, S., Cohen, S. B., and Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolu- tional Neural Networks for Extreme Summarization, Au- 9 LLM Evaluators Recognize and Favor Their Own Generations gust
-
[18]
URL http://arxiv.org/abs/1808. 08745. arXiv:1808.08745 [cs] version:
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Feed- back loops with language models drive in-context reward hacking
Pan, A., Jones, E., Jagadeesan, M., and Steinhardt, J. Feed- back loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627,
- [21]
-
[22]
URL http://arxiv.org/ abs/2308.11483. arXiv:2308.11483 [cs]. Raina, V ., Liusie, A., and Gales, M. Is llm-as-a-judge ro- bust? investigating universal adversarial attacks on zero- shot llm assessment. arXiv preprint arXiv:2402.14016,
-
[23]
Self-critiquing models for assisting human evaluators
Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URL http://arxiv.org/ abs/2310.07611. arXiv:2310.07611 [cs]. Shridhar, K., Sinha, K., Cohen, A., Wang, T., Yu, P., Pa- sunuru, R., Sachan, M., Weston, J., and Celikyilmaz, A. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961,
-
[25]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
URL http:// arxiv.org/abs/2206.04615. arXiv:2206.04615 [cs, stat]. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Ad- vances in Neural Information Processing Systems , 33: 3008–3021,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Wang, Y ., Liao, Y ., Liu, H., Liu, H., Wang, Y ., and Wang, Y . Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529,
-
[28]
Recursively Summarizing Books with Human Feedback
Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. Recursively summa- rizing books with human feedback. arXiv preprint arXiv:2109.10862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Wu, J., Yang, S., Zhan, R., Yuan, Y ., Wong, D. F., and Chao, L. S. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724,
- [30]
-
[31]
Yang, X., Pan, L., Zhao, X., Chen, H., Petzold, L., Wang, W. Y ., and Cheng, W. A survey on detection of llms- generated content. arXiv preprint arXiv:2310.15654 ,
-
[32]
Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,
Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., and Huang, X. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,
- [33]
-
[34]
URL http:// arxiv.org/abs/2310.07641. arXiv:2310.07641 [cs]. Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Ad- vances in Neural Information Processing Systems , 36,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.