Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

Pierrick Bougault; William Guey

arxiv: 2606.20093 · v1 · pith:NPZSKQGSnew · submitted 2026-06-18 · 💻 cs.CL

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

William Guey , Pierrick Bougault This is my paper

Pith reviewed 2026-06-26 17:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-preference biasinstruction-followingtext revisionlarge language modelsIFEvalauthorshipmodel evaluationbias detection

0 comments

The pith

Large language models do not resist verified corrections to their own drafts more than neutral models do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether models show self-preference by rejecting good edits to their own text. It sets up a test using the IFEval benchmark where a deterministic checker confirms that a draft violates an instruction and that an edit fixes it. Models then judge the edit either as the author who wrote the draft or as a fresh model seeing it for the first time. The result across four models is that rejection rates are nearly identical, with authors slightly less likely to reject. This challenges the assumption that self-preference bias applies to revision decisions when validity is objectively verifiable.

Core claim

When models are presented with a verified-good edit to a draft they authored, they accept it at rates statistically equivalent to models that did not author the draft, showing a gap of only -5.1 percentage points with a 95% confidence interval from -12.9 to +2.7 percentage points.

What carries the argument

The genuine authorship comparison using an objective instruction-following verifier to define valid edits.

If this is right

AI writing assistants can incorporate self-revisions without special bias corrections for ownership.
Self-preference appears limited to subjective judgments rather than objective constraint satisfaction.
Rejection reasons given by models focus on identifying remaining flaws in 97% of cases.
Sample size rules out effects larger than about 13 percentage points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The absence of bias may allow more seamless multi-model collaboration on constrained writing tasks.
Extending the method to other domains with objective verifiers, such as code linting, could test generality.
If small biases exist below the detection threshold, they might accumulate in long revision chains.
Human authors might behave differently, suggesting a need for parallel human tests.

Load-bearing premise

The official IFEval checker accurately identifies corrections that genuinely improve adherence to the instruction without introducing other issues.

What would settle it

A larger study or one using a different objective verifier that finds authors reject significantly more verified fixes than fresh models.

read the original abstract

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Null result on self-preference in verifiable revision is credible on its own terms but the IFEval checker leaves open whether the edits are unambiguously good.

read the letter

The paper's core result is that four mid-tier models reject verified-good fixes to their own drafts at roughly the same rate as fresh models do (gap of -5.1 pp with CI crossing zero). They reach this by having a model write a draft that fails an IFEval constraint, confirming a candidate edit passes the official checker, then measuring acceptance when the model is the in-context author versus a neutral judge.

What is new is the protocol itself: genuine authorship plus a deterministic external verifier instead of another LLM as judge. That removes one source of circularity that earlier self-preference studies carried. The statistical reporting is also straightforward: pre-specified comparison, 85 paired trials, explicit CI.

The soft spot is exactly the one the stress-test note raises. The checker only certifies the targeted constraint; it does not check semantic fidelity, style, or side effects on other constraints. If a non-trivial share of the "verified-good" edits still contain defects that a reasonable reader would flag, then equal rejection rates simply show both author and neutral models are doing ordinary quality control. The 97 % flaw-catching reason rate is consistent with that possibility and does not rule it out. Sample size also means effects smaller than about 13 pp remain undetectable.

The work is aimed at people studying LLM self-revision and judge bias. It is worth sending to referees because the design is cleaner than most prior attempts and the null finding is reported with appropriate caution, even if the interpretation of "valid correction" needs more scrutiny in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that self-preference bias is weak or absent when LLMs revise their own instruction-following drafts under verifiable conditions. Using the deterministic IFEval checker to confirm that a draft violates a constraint and that a candidate edit satisfies it, the study compares rejection rates of these verified-good edits when the model judges as the genuine author versus as a fresh model. Across four mid-tier model families and 85 paired author-versus-fresh comparisons, rejection rates show no detectable difference (gap -5.1 pp, 95% CI [-12.9, +2.7]). Rejections are overwhelmingly attributed to flaw-catching reasons (97%), and a self-skepticism signal from a pilot did not replicate.

Significance. If the result holds, the work supplies a controlled empirical measurement showing that self-preference does not appear in revision tasks where validity is defined by an external deterministic verifier rather than another model. Strengths include the use of genuine authorship conditions, pre-specified statistical comparison with confidence interval, and collection of qualitative reason codes for rejections. This contributes to understanding the boundary conditions of self-preference bias and has implications for LLM self-correction pipelines.

major comments (2)

[Abstract and the paragraph describing the verifier step] Abstract and the paragraph describing the verifier step: The central claim equates equal rejection rates with absence of self-preference only if the 85 verified-good edits are unambiguously improvements that should be accepted. The design uses the deterministic IFEval checker to confirm (a) the original draft violates a constraint and (b) the candidate edit satisfies it. This supplies an objective label, but the checker evaluates only the targeted constraint; it does not audit semantic fidelity, stylistic coherence, or side-effect violations of other constraints. If a non-negligible fraction of “verified-good” edits contain such defects, then rejections (by both authors and fresh models) can reflect ordinary quality control rather than authorship bias. The reported 97 % “flaw-catching” reason rate is consistent with this possibility and does not rule it out. The statistical comparison th
[Results section on the statistical comparison] Results section on the statistical comparison: While the pre-specified analysis and reporting of the 95% CI are strengths, the manuscript should explicitly discuss whether any post-hoc exclusion rules or prompt variations across the four model families could affect the paired comparison, as these details bear on whether the CI fully accounts for design choices.

minor comments (2)

[Abstract] The abstract reference to a 'self-skepticism hint from a smaller pilot' would benefit from a one-sentence summary of the pilot result to orient readers before the non-replication statement.
[Methods] Ensure that exact prompt templates used for the author versus fresh conditions are provided in an appendix or supplementary material to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and the paragraph describing the verifier step] Abstract and the paragraph describing the verifier step: The central claim equates equal rejection rates with absence of self-preference only if the 85 verified-good edits are unambiguously improvements that should be accepted. The design uses the deterministic IFEval checker to confirm (a) the original draft violates a constraint and (b) the candidate edit satisfies it. This supplies an objective label, but the checker evaluates only the targeted constraint; it does not audit semantic fidelity, stylistic coherence, or side-effect violations of other constraints. If a non-negligible fraction of “verified-good” edits contain such defects, then rejections (by both authors and fresh models) can reflect ordinary quality control rather than authorship bias. The reported 97 % “flaw-catching” reason rate is consistent with this possibili

Authors: We agree that the IFEval checker provides an objective but narrow verification limited to the targeted constraint and does not guarantee overall semantic fidelity or absence of other defects. This is a valid point about the scope of 'verified-good.' However, because the author and fresh-model conditions evaluate identical edits in a paired design, any unverified defects would affect rejection decisions equally in both arms; thus the absence of a rate difference still indicates no detectable authorship-based bias in this controlled setting. The 97% flaw-catching rate is consistent with quality-control behavior rather than preference. We will revise the abstract and verifier-step paragraph to explicitly qualify that verification is constraint-specific and to temper the interpretation of 'absence of self-preference' accordingly. revision: yes
Referee: [Results section on the statistical comparison] Results section on the statistical comparison: While the pre-specified analysis and reporting of the 95% CI are strengths, the manuscript should explicitly discuss whether any post-hoc exclusion rules or prompt variations across the four model families could affect the paired comparison, as these details bear on whether the CI fully accounts for design choices.

Authors: We agree that explicit discussion of these design details would strengthen transparency. All four model families used the same core prompt template, with only minimal adaptations required for each model's chat format; no post-hoc exclusions were applied beyond the pre-specified criteria for including a paired comparison (i.e., cases where the edit was confirmed by the checker to fix the constraint). We will add a short paragraph in the Results section describing these choices and confirming that the paired CI calculation already incorporates the model-specific formatting variations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement against external deterministic verifier

full rationale

The paper reports an experimental comparison of rejection rates for IFEval-verified edits, using the official deterministic checker to label drafts and candidate fixes. The central statistic (gap of -5.1 pp) is a direct empirical measurement across 85 author-versus-fresh comparisons; no equations, fitted parameters, or predictions are defined in terms of the target result. No self-citations are invoked to justify uniqueness or load-bearing premises, and the verifier is external and model-independent. The design is self-contained against the stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the IFEval checker supplies an objective ground truth for valid edits; no free parameters are fitted to produce the gap statistic, and no new entities are postulated.

axioms (1)

domain assumption The official IFEval checker correctly and objectively identifies whether a candidate edit fixes the violated constraint.
Invoked in the abstract when defining "verified-good fixes" that the model then accepts or rejects.

pith-pipeline@v0.9.1-grok · 5787 in / 1480 out tokens · 36124 ms · 2026-06-26T17:24:03.719650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 6 linked inside Pith

[1]

and Feng, Shi , booktitle =

Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle =. 2024 , note =

2024
[2]

Pride and Prejudice:

Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William Yang , booktitle =. Pride and Prejudice:. 2024 , note =

2024
[3]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning , author =. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[4]

Challenging the Evaluator:

Kim, Sung Won and Khashabi, Daniel , booktitle =. Challenging the Evaluator:. 2025 , note =

2025
[5]

Feedback Friction:

Jiang, Dongwei and Zhang, Alvin and Wang, Andrew and Andrews, Nicholas and Khashabi, Daniel , year =. Feedback Friction:
[6]

Cross-Context Review: Improving

Song, Tae-Eun , year =. Cross-Context Review: Improving
[7]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , note =

2023
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[9]

International Conference on Learning Representations (ICLR) , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author =. International Conference on Learning Representations (ICLR) , year =
[10]

When Can

Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Han, Jiawei and Zhang, Rui , journal =. When Can. 2024 , note =

2024
[11]

2023 , note =

Instruction-Following Evaluation for Large Language Models , author =. 2023 , note =

2023
[12]

When identity skews debate: Anonymization for bias-reduced multi-agent reasoning

Hyeong Kyu Choi, Xiaojin Zhu, and Yixuan Li. When identity skews debate: Anonymization for bias-reduced multi-agent reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. arXiv:2510.07517

Pith/arXiv arXiv 2026
[13]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024. arXiv:2310.01798

Pith/arXiv arXiv 2024
[14]

Feedback friction: LLMs struggle to fully incorporate external feedback, 2025

Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: LLMs struggle to fully incorporate external feedback, 2025. arXiv:2506.11930

arXiv 2025
[15]

When can LLMs actually correct their own mistakes? a critical survey of self-correction of LLMs

Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can LLMs actually correct their own mistakes? a critical survey of self-correction of LLMs . Transactions of the Association for Computational Linguistics (TACL), 12: 0 1417--1440, 2024. arXiv:2406.01297

arXiv 2024
[16]

Challenging the evaluator: LLM sycophancy under user rebuttal

Sung Won Kim and Daniel Khashabi. Challenging the evaluator: LLM sycophancy under user rebuttal. In Findings of the Association for Computational Linguistics: EMNLP, 2025. arXiv:2509.16533

arXiv 2025
[17]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.17651

Pith/arXiv arXiv 2023
[18]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2404.13076

Pith/arXiv arXiv 2024
[19]

Cross-context review: Improving LLM output quality by separating production and review sessions, 2026

Tae-Eun Song. Cross-context review: Improving LLM output quality by separating production and review sessions, 2026. arXiv:2603.12123

arXiv 2026
[20]

Pride and prejudice: LLM amplifies self-bias in self-refinement

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. arXiv:2402.11436

arXiv 2024
[21]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023
[22]

Instruction-following evaluation for large language models, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. arXiv:2311.07911

Pith/arXiv arXiv 2023

[1] [1]

and Feng, Shi , booktitle =

Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle =. 2024 , note =

2024

[2] [2]

Pride and Prejudice:

Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William Yang , booktitle =. Pride and Prejudice:. 2024 , note =

2024

[3] [3]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning , author =. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[4] [4]

Challenging the Evaluator:

Kim, Sung Won and Khashabi, Daniel , booktitle =. Challenging the Evaluator:. 2025 , note =

2025

[5] [5]

Feedback Friction:

Jiang, Dongwei and Zhang, Alvin and Wang, Andrew and Andrews, Nicholas and Khashabi, Daniel , year =. Feedback Friction:

[6] [6]

Cross-Context Review: Improving

Song, Tae-Eun , year =. Cross-Context Review: Improving

[7] [7]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , note =

2023

[8] [8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[9] [9]

International Conference on Learning Representations (ICLR) , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author =. International Conference on Learning Representations (ICLR) , year =

[10] [10]

When Can

Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Han, Jiawei and Zhang, Rui , journal =. When Can. 2024 , note =

2024

[11] [11]

2023 , note =

Instruction-Following Evaluation for Large Language Models , author =. 2023 , note =

2023

[12] [12]

When identity skews debate: Anonymization for bias-reduced multi-agent reasoning

Hyeong Kyu Choi, Xiaojin Zhu, and Yixuan Li. When identity skews debate: Anonymization for bias-reduced multi-agent reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. arXiv:2510.07517

Pith/arXiv arXiv 2026

[13] [13]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024. arXiv:2310.01798

Pith/arXiv arXiv 2024

[14] [14]

Feedback friction: LLMs struggle to fully incorporate external feedback, 2025

Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: LLMs struggle to fully incorporate external feedback, 2025. arXiv:2506.11930

arXiv 2025

[15] [15]

When can LLMs actually correct their own mistakes? a critical survey of self-correction of LLMs

Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can LLMs actually correct their own mistakes? a critical survey of self-correction of LLMs . Transactions of the Association for Computational Linguistics (TACL), 12: 0 1417--1440, 2024. arXiv:2406.01297

arXiv 2024

[16] [16]

Challenging the evaluator: LLM sycophancy under user rebuttal

Sung Won Kim and Daniel Khashabi. Challenging the evaluator: LLM sycophancy under user rebuttal. In Findings of the Association for Computational Linguistics: EMNLP, 2025. arXiv:2509.16533

arXiv 2025

[17] [17]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.17651

Pith/arXiv arXiv 2023

[18] [18]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2404.13076

Pith/arXiv arXiv 2024

[19] [19]

Cross-context review: Improving LLM output quality by separating production and review sessions, 2026

Tae-Eun Song. Cross-context review: Improving LLM output quality by separating production and review sessions, 2026. arXiv:2603.12123

arXiv 2026

[20] [20]

Pride and prejudice: LLM amplifies self-bias in self-refinement

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. arXiv:2402.11436

arXiv 2024

[21] [21]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.05685

Pith/arXiv arXiv 2023

[22] [22]

Instruction-following evaluation for large language models, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. arXiv:2311.07911

Pith/arXiv arXiv 2023