LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery; Samuel R. Bowman; Shi Feng

arxiv: 2404.13076 · v1 · pith:Q6LPEAMDnew · submitted 2024-04-15 · 💻 cs.CL · cs.AI

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery , Samuel R. Bowman , Shi Feng This is my paper

Pith reviewed 2026-05-22 18:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-preference biasLLM evaluationself-recognitionlarge language modelsbias in AIreward modelingconstitutional AI

0 comments

The pith

LLMs can identify their own generations and this recognition causes them to score those outputs higher than equivalent text from other sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the self-preference bias in LLM evaluators arises because the models recognize when a text is their own creation. Models such as GPT-4 and Llama 2 already distinguish their outputs from those of other LLMs or humans at above-chance rates. Fine-tuning improves this recognition ability, and the improvement tracks linearly with how strongly the model favors its own text. Controlled experiments indicate the link is causal rather than an artifact of style or length differences. If correct, this means any evaluation method that lets the same model generate and judge content will carry a built-in tilt toward the model's own patterns.

Core claim

Out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. Fine-tuning reveals a linear correlation between self-recognition capability and the strength of self-preference bias. Controlled experiments show the causal explanation resists straightforward confounders.

What carries the argument

Self-recognition capability, defined as the accuracy with which an LLM classifies a given text sample as having been generated by itself versus by another source.

If this is right

Self-preference bias will appear in reward modeling and constitutional AI whenever the same model generates and judges content.
Benchmarking that uses LLM judges will systematically over-rate outputs matching the judge's own generation style.
AI safety evaluations relying on self-evaluation risk under-valuing safety properties that differ from the evaluator's own patterns.
Unbiased automated evaluation requires either separate models for generation and judging or explicit controls that block self-recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures that deliberately obscure a model's own stylistic fingerprints could reduce self-preference without harming other capabilities.
The same recognition mechanism may create broader familiarity biases when models evaluate any content drawn from distributions they have seen during training.
Analogous tests could check whether human evaluators show similar preference effects when scoring text from sources whose style they have internalized.

Load-bearing premise

The fine-tuning procedure and controlled experiments isolate self-recognition as the causal driver of self-preference rather than correlated changes in output style, length, or topic distribution.

What would settle it

An experiment that increases self-recognition accuracy through fine-tuning yet leaves self-preference scores unchanged would falsify the claimed causal link.

read the original abstract

Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs such as GPT-4 and Llama 2 exhibit non-trivial out-of-the-box accuracy in distinguishing their own generations from those of other LLMs and humans. Fine-tuning on a self-recognition objective produces a linear correlation between improved self-recognition capability and the strength of self-preference bias in evaluations. Controlled experiments are presented to argue that this relationship is causal and resists straightforward confounders such as changes in output style or distribution.

Significance. If the causal link between self-recognition and self-preference holds after rigorous controls, the result would be significant for LLM-based evaluation pipelines used in benchmarking, reward modeling, constitutional AI, and self-refinement. It would highlight a previously under-examined source of bias that could affect the reliability of automated evaluations and raise implications for AI safety when models evaluate their own outputs.

major comments (2)

[§4] §4 (Controlled Experiments): The abstract states that controlled experiments demonstrate the causal link resists straightforward confounders, yet the precise controls (length-matched sampling, regression on lexical diversity or perplexity, topic entropy matching, or style-feature covariates) are not enumerated with sufficient detail or statistical reporting. Without these specifics, the linear correlation obtained via fine-tuning could still be driven by correlated shifts in generation properties rather than recognition per se.
[§3.2] §3.2 (Fine-tuning Procedure): The fine-tuning objective for self-recognition is described at a high level, but the paper does not report whether generation-length, token-distribution, or stylistic statistics were explicitly regularized or measured before and after fine-tuning. If these properties change systematically, they constitute a plausible alternative driver of the observed preference scores.

minor comments (2)

[Results] Table 1 or equivalent results table: report exact sample sizes, number of generations per model, and confidence intervals or p-values for the out-of-the-box discrimination accuracies.
[Results] Figure 2 (correlation plot): clarify whether the x-axis (self-recognition accuracy) and y-axis (self-preference delta) are computed on held-out data or on the fine-tuning distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the exposition of our controls and fine-tuning measurements. We address each major comment below and have prepared revisions that directly incorporate the requested details and statistical reporting.

read point-by-point responses

Referee: [§4] §4 (Controlled Experiments): The abstract states that controlled experiments demonstrate the causal link resists straightforward confounders, yet the precise controls (length-matched sampling, regression on lexical diversity or perplexity, topic entropy matching, or style-feature covariates) are not enumerated with sufficient detail or statistical reporting. Without these specifics, the linear correlation obtained via fine-tuning could still be driven by correlated shifts in generation properties rather than recognition per se.

Authors: We agree that the current description of the controls in §4 lacks the granularity needed to fully address potential alternative explanations. In the revised manuscript we have expanded this section to enumerate the controls explicitly and to report the associated statistics. Specifically, we applied length-matched sampling by restricting comparisons to generation pairs whose token lengths differed by at most 5 %; we performed ordinary-least-squares regressions that included type-token ratio and perplexity as covariates; we matched generations on topic entropy derived from LDA models; and we added style-feature covariates (average sentence length, punctuation density, and vocabulary richness) to the preference-score models. The revised text now includes a table of regression results showing that the coefficient on self-recognition accuracy remains positive and significant (p < 0.01) after inclusion of these controls. These additions directly respond to the concern that generation-property shifts could drive the observed correlation. revision: yes
Referee: [§3.2] §3.2 (Fine-tuning Procedure): The fine-tuning objective for self-recognition is described at a high level, but the paper does not report whether generation-length, token-distribution, or stylistic statistics were explicitly regularized or measured before and after fine-tuning. If these properties change systematically, they constitute a plausible alternative driver of the observed preference scores.

Authors: We acknowledge that §3.2 would benefit from explicit reporting of these statistics. In the revision we have added a paragraph and an accompanying supplementary table that document the measurements taken before and after fine-tuning. No explicit regularization on length, token distribution, or style was applied during fine-tuning, in order to preserve the model’s natural generation behavior. Post-hoc checks nevertheless show that mean generation length changed by fewer than three tokens, KL divergence between pre- and post-fine-tuning token distributions remained below 0.05, and differences in stylistic metrics (Flesch reading-ease score and type-token ratio) were statistically non-significant (two-sample t-tests, all p > 0.1). These results are now reported so that readers can evaluate whether systematic distributional shifts could explain the preference-score changes. revision: yes

Circularity Check

0 steps flagged

Empirical study with external benchmarks; no derivation reduces to inputs by construction

full rationale

The paper conducts an empirical investigation using out-of-the-box discrimination accuracy on model outputs, fine-tuning to observe correlations, and controlled experiments comparing to human annotations. No mathematical derivation chain exists that equates a 'prediction' or result to its inputs by definition or self-citation. Fine-tuning introduces hyperparameters but functions as an experimental manipulation rather than a fitted parameter renamed as a prediction. Central claims rely on direct comparisons to external human judgments and other LLMs, making the work self-contained against benchmarks. Any self-citations are not load-bearing for the empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is primarily empirical and relies on standard assumptions about LLM behavior and evaluation protocols rather than new axioms or invented entities.

axioms (1)

domain assumption LLM outputs can be meaningfully compared for quality by both the model itself and human annotators under the same rubric
Implicit in the self-preference and self-recognition measurement setup described in the abstract.

pith-pipeline@v0.9.0 · 5713 in / 1226 out tokens · 29503 ms · 2026-05-22T18:40:36.383381+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
cs.AI 2026-05 unverdicted novelty 7.0

Defines GEA validity criterion and reports first measurement of r=0.698 recovery with positive bias in LLM two-stage adaptive assessment, stronger for verifiable skills.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
cs.LG 2026-05 unverdicted novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
cs.CV 2026-04 unverdicted novelty 7.0

Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
AMEL: Accumulated Message Effects on LLM Judgments
cs.AI 2026-05 conditional novelty 6.0

LLMs exhibit an accumulated message effect where conversation history saturated with positive or negative evaluations biases subsequent judgments, with larger shifts on uncertain items, a negativity asymmetry, and no ...
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
cs.CL 2026-05 unverdicted novelty 6.0

Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
cs.LG 2026-05 unverdicted novelty 6.0

A new benchmark uses separate predictor and scorer LLMs to test whether forecast strings improve likelihood of hidden mathematical equation continuations, with controls that detect priming shortcuts.
Automated alignment is harder than you think
cs.AI 2026-05 conditional novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents
cs.CV 2026-04 accept novelty 6.0

GPT-Image-2 document forgeries evade human and computational detection while traditional tampering remains detectable, with the model itself failing as a self-judge.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Self-Preference Bias in LLM-as-a-Judge
cs.CL 2024-10 unverdicted novelty 6.0

LLMs judge their own outputs higher because they assign better scores to lower-perplexity text, even when the text is not self-generated.
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
cs.AI 2026-05 unverdicted novelty 5.0

Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.
Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks
cs.CR 2026-05 unverdicted novelty 5.0

Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
cs.CL 2026-05 unverdicted novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 5.0

Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
cs.AI 2026-04 unverdicted novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 unverdicted novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
cs.IR 2026-01 unverdicted novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
cs.CL 2025-11 unverdicted novelty 4.0

Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
cs.AI 2025-04 unverdicted novelty 3.0

A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 22 Pith papers · 12 internal anchors

[1]

Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models

Amayuelas, A., Pan, L., Chen, W., and Wang, W. Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models. arXiv preprint arXiv:2305.13712,

work page arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O

Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667,

work page arXiv
[4]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use,

Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R., and Schimdt, L. Visit-bench: A benchmark for vision-language instruc- tion following inspired by real-world use. Advances in Neural Information Processing Systems, 2023a. Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R....

work page arXiv 1901
[5]

doi: 10.1037/h0057532

ISSN 1939-1854. doi: 10.1037/h0057532. Place: US Publisher: American Psy- chological Association. Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. GPTScore: Evaluate as You Desire, February

work page doi:10.1037/h0057532 1939
[6]

GPTScore: Evaluate as You Desire

URL http://arxiv. org/abs/2302.04166. arXiv:2302.04166 [cs]. Hackl, V ., M¨uller, A. E., Granitzer, M., and Sailer, M. Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings. Frontiers in Education, 8:1272229, Decem- ber

work page internal anchor Pith review Pith/arXiv arXiv
[7]

doi: 10.3389/feduc.2023

ISSN 2504-284X. doi: 10.3389/feduc.2023. 1272229. URL http://arxiv.org/abs/2308. 02575. arXiv:2308.02575 [cs]. Hans, A., Schwarzschild, A., Cherepanova, V ., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., and Gold- stein, T. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070,

work page doi:10.3389/feduc.2023 2023
[8]

Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L

Research submission to the Evals research sprint hosted by Apart Research. Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L. V . Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314,

work page arXiv 2011
[9]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

I., Kim, Z

Koo, R., Lee, M., Raheja, V ., Park, J. I., Kim, Z. M., and Kang, D. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012,

work page arXiv
[11]

A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion

Kumarage, T., Agrawal, G., Sheth, P., Moraffah, R., Chadha, A., Garland, J., and Liu, H. A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion. arXiv preprint arXiv:2403.01152,

work page arXiv
[12]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V ., and Rastogi, A. RLAIF: Scal- ing reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Scalable agent alignment via reward modeling: a research direction

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V ., and Legg, S. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

original-date: 2023-05- 25T09:35:28Z

URL https://github.com/ tatsu-lab/alpaca_eval. original-date: 2023-05- 25T09:35:28Z. Liu, Y ., Moosavi, N. S., and Lin, C. LLMs as Narcissis- tic Evaluators: When Ego Inflates Evaluation Scores, November

work page 2023
[15]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B

URL https://arxiv.org/abs/ 2311.09766v1. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-Refine: Iterative Re- finement with Self-Feedback, May

work page arXiv
[16]

Self-Refine: Iterative Refinement with Self-Feedback

URLhttp:// arxiv.org/abs/2303.17651. arXiv:2303.17651 [cs]. Mitchell, E., Lee, Y ., Khazatsky, A., Manning, C. D., and Finn, C. Detectgpt: Zero-shot machine-generated text de- tection using probability curvature. In International Con- ference on Machine Learning, pp. 24950–24962. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

doi: 10.18653/v1/K16-1028

Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https://aclanthology.org/K16-1028. Narayan, S., Cohen, S. B., and Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolu- tional Neural Networks for Extreme Summarization, Au- 9 LLM Evaluators Recognize and Favor Their Own Generations gust

work page doi:10.18653/v1/k16-1028
[18]

URL http://arxiv.org/abs/1808. 08745. arXiv:1808.08745 [cs] version:

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Feed- back loops with language models drive in-context reward hacking

Pan, A., Jones, E., Jagadeesan, M., and Steinhardt, J. Feed- back loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627,

work page arXiv
[21]

URL http://arxiv.org/abs/2311. 08576. arXiv:2311.08576 [cs]. Pezeshkpour, P. and Hruschka, E. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions, August

work page arXiv
[22]

arXiv:2308.11483 [cs]

URL http://arxiv.org/ abs/2308.11483. arXiv:2308.11483 [cs]. Raina, V ., Liusie, A., and Gales, M. Is llm-as-a-judge ro- bust? investigating universal adversarial attacks on zero- shot llm assessment. arXiv preprint arXiv:2402.14016,

work page arXiv
[23]

Self-critiquing models for assisting human evaluators

Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv:2310.07611 [cs]

URL http://arxiv.org/ abs/2310.07611. arXiv:2310.07611 [cs]. Shridhar, K., Sinha, K., Cohen, A., Wang, T., Yu, P., Pa- sunuru, R., Sachan, M., Weston, J., and Celikyilmaz, A. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961,

work page arXiv
[25]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

URL http:// arxiv.org/abs/2206.04615. arXiv:2206.04615 [cs, stat]. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Ad- vances in Neural Information Processing Systems , 33: 3008–3021,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception

Wang, Y ., Liao, Y ., Liu, H., Liu, H., Wang, Y ., and Wang, Y . Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529,

work page arXiv
[28]

Recursively Summarizing Books with Human Feedback

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. Recursively summa- rizing books with human feedback. arXiv preprint arXiv:2109.10862,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

F., and Chao, L

Wu, J., Yang, S., Zhan, R., Yuan, Y ., Wong, D. F., and Chao, L. S. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724,

work page arXiv
[30]

Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., and Wang, W. Y . Perils of self-feedback: Self-bias amplifies in large language models. arXiv preprint arXiv:2402.11436,

work page arXiv
[31]

Y ., and Cheng, W

Yang, X., Pan, L., Zhao, X., Chen, H., Petzold, L., Wang, W. Y ., and Cheng, W. A survey on detection of llms- generated content. arXiv preprint arXiv:2310.15654 ,

work page arXiv
[32]

Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,

Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., and Huang, X. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,

work page arXiv
[33]

URL http://arxiv.org/abs/2308. 01240. arXiv:2308.01240 [cs]. Zeng, Z., Yu, J., Gao, T., Meng, Y ., Goyal, T., and Chen, D. Evaluating Large Language Models at Evaluating Instruction Following, October

work page arXiv
[34]

arXiv:2310.07641 [cs]

URL http:// arxiv.org/abs/2310.07641. arXiv:2310.07641 [cs]. Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Ad- vances in Neural Information Processing Systems , 36,

work page arXiv

[1] [1]

Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models

Amayuelas, A., Pan, L., Chen, W., and Wang, W. Knowl- edge of knowledge: Exploring known-unknowns un- certainty with large language models. arXiv preprint arXiv:2305.13712,

work page arXiv

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O

Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667,

work page arXiv

[4] [4]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use,

Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R., and Schimdt, L. Visit-bench: A benchmark for vision-language instruc- tion following inspired by real-world use. Advances in Neural Information Processing Systems, 2023a. Bitton, Y ., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R....

work page arXiv 1901

[5] [5]

doi: 10.1037/h0057532

ISSN 1939-1854. doi: 10.1037/h0057532. Place: US Publisher: American Psy- chological Association. Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. GPTScore: Evaluate as You Desire, February

work page doi:10.1037/h0057532 1939

[6] [6]

GPTScore: Evaluate as You Desire

URL http://arxiv. org/abs/2302.04166. arXiv:2302.04166 [cs]. Hackl, V ., M¨uller, A. E., Granitzer, M., and Sailer, M. Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings. Frontiers in Education, 8:1272229, Decem- ber

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

doi: 10.3389/feduc.2023

ISSN 2504-284X. doi: 10.3389/feduc.2023. 1272229. URL http://arxiv.org/abs/2308. 02575. arXiv:2308.02575 [cs]. Hans, A., Schwarzschild, A., Cherepanova, V ., Kazemi, H., Saha, A., Goldblum, M., Geiping, J., and Gold- stein, T. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070,

work page doi:10.3389/feduc.2023 2023

[8] [8]

Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L

Research submission to the Evals research sprint hosted by Apart Research. Jawahar, G., Abdul-Mageed, M., and Lakshmanan, L. V . Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314,

work page arXiv 2011

[9] [9]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

I., Kim, Z

Koo, R., Lee, M., Raheja, V ., Park, J. I., Kim, Z. M., and Kang, D. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012,

work page arXiv

[11] [11]

A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion

Kumarage, T., Agrawal, G., Sheth, P., Moraffah, R., Chadha, A., Garland, J., and Liu, H. A survey of ai-generated text forensic systems: Detection, attribution, and characteriza- tion. arXiv preprint arXiv:2403.01152,

work page arXiv

[12] [12]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V ., and Rastogi, A. RLAIF: Scal- ing reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Scalable agent alignment via reward modeling: a research direction

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V ., and Legg, S. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

original-date: 2023-05- 25T09:35:28Z

URL https://github.com/ tatsu-lab/alpaca_eval. original-date: 2023-05- 25T09:35:28Z. Liu, Y ., Moosavi, N. S., and Lin, C. LLMs as Narcissis- tic Evaluators: When Ego Inflates Evaluation Scores, November

work page 2023

[15] [15]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B

URL https://arxiv.org/abs/ 2311.09766v1. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-Refine: Iterative Re- finement with Self-Feedback, May

work page arXiv

[16] [16]

Self-Refine: Iterative Refinement with Self-Feedback

URLhttp:// arxiv.org/abs/2303.17651. arXiv:2303.17651 [cs]. Mitchell, E., Lee, Y ., Khazatsky, A., Manning, C. D., and Finn, C. Detectgpt: Zero-shot machine-generated text de- tection using probability curvature. In International Con- ference on Machine Learning, pp. 24950–24962. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

doi: 10.18653/v1/K16-1028

Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https://aclanthology.org/K16-1028. Narayan, S., Cohen, S. B., and Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolu- tional Neural Networks for Extreme Summarization, Au- 9 LLM Evaluators Recognize and Favor Their Own Generations gust

work page doi:10.18653/v1/k16-1028

[18] [18]

URL http://arxiv.org/abs/1808. 08745. arXiv:1808.08745 [cs] version:

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Feed- back loops with language models drive in-context reward hacking

Pan, A., Jones, E., Jagadeesan, M., and Steinhardt, J. Feed- back loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627,

work page arXiv

[21] [21]

URL http://arxiv.org/abs/2311. 08576. arXiv:2311.08576 [cs]. Pezeshkpour, P. and Hruschka, E. Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions, August

work page arXiv

[22] [22]

arXiv:2308.11483 [cs]

URL http://arxiv.org/ abs/2308.11483. arXiv:2308.11483 [cs]. Raina, V ., Liusie, A., and Gales, M. Is llm-as-a-judge ro- bust? investigating universal adversarial attacks on zero- shot llm assessment. arXiv preprint arXiv:2402.14016,

work page arXiv

[23] [23]

Self-critiquing models for assisting human evaluators

Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv:2310.07611 [cs]

URL http://arxiv.org/ abs/2310.07611. arXiv:2310.07611 [cs]. Shridhar, K., Sinha, K., Cohen, A., Wang, T., Yu, P., Pa- sunuru, R., Sachan, M., Weston, J., and Celikyilmaz, A. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961,

work page arXiv

[25] [25]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

URL http:// arxiv.org/abs/2206.04615. arXiv:2206.04615 [cs, stat]. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Ad- vances in Neural Information Processing Systems , 33: 3008–3021,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception

Wang, Y ., Liao, Y ., Liu, H., Liu, H., Wang, Y ., and Wang, Y . Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529,

work page arXiv

[28] [28]

Recursively Summarizing Books with Human Feedback

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. Recursively summa- rizing books with human feedback. arXiv preprint arXiv:2109.10862,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

F., and Chao, L

Wu, J., Yang, S., Zhan, R., Yuan, Y ., Wong, D. F., and Chao, L. S. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724,

work page arXiv

[30] [30]

Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., and Wang, W. Y . Perils of self-feedback: Self-bias amplifies in large language models. arXiv preprint arXiv:2402.11436,

work page arXiv

[31] [31]

Y ., and Cheng, W

Yang, X., Pan, L., Zhao, X., Chen, H., Petzold, L., Wang, W. Y ., and Cheng, W. A survey on detection of llms- generated content. arXiv preprint arXiv:2310.15654 ,

work page arXiv

[32] [32]

Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,

Yin, Z., Sun, Q., Guo, Q., Wu, J., Qiu, X., and Huang, X. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153,

work page arXiv

[33] [33]

URL http://arxiv.org/abs/2308. 01240. arXiv:2308.01240 [cs]. Zeng, Z., Yu, J., Gao, T., Meng, Y ., Goyal, T., and Chen, D. Evaluating Large Language Models at Evaluating Instruction Following, October

work page arXiv

[34] [34]

arXiv:2310.07641 [cs]

URL http:// arxiv.org/abs/2310.07641. arXiv:2310.07641 [cs]. Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Ad- vances in Neural Information Processing Systems , 36,

work page arXiv