arxiv: 2605.01130 · v1 · submitted 2026-05-01 · 💻 cs.AI

Recognition: unknown

Iterative Finetuning is Mostly Idempotent

Ari Holtzman, Aryan Shrivastava, Chenhao Tan, Dang Nguyen, Jack Sanderson, Julian Huang, Todd Nief, Zephaniah Roe

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords iterative finetuningself-generated databehavioral traitssycophancymodel alignmentdirect preference optimizationsupervised finetuningsynthetic document finetuning

0 comments

The pith

Models trained iteratively on their own outputs mostly cause seeded traits to decay or stabilize instead of amplify.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models will strengthen initial behavioral tendencies, such as sycophancy, when each generation is finetuned on data produced by the previous one. It examines this process in three regimes: supervised finetuning of instruction models, synthetic document finetuning of base models, and direct preference optimization. In the first two regimes, trait levels usually fall or hold steady after the initial cycle, so later rounds produce little further change. In DPO, reliable amplification occurs only when training continues without resetting to the base model at each step. The results indicate that accidental runaway growth of unwanted traits is uncommon under ordinary non-RL finetuning and that limiting continual post-training offers a practical way to reduce such risks.

Core claim

When a model is seeded with a persona or belief and then a series of successor models are each finetuned on data generated by their immediate predecessor, supervised finetuning and synthetic document finetuning produce trait decay or constancy, rendering the process idempotent after the first cycle. In direct preference optimization, trait amplification occurs reliably only under continual training that prefers the model's own outputs; the effect disappears when each cycle reinitializes from the original model. Any amplification that does appear tends to reduce output coherence.

What carries the argument

The closed iterative loop of persona-seeded data generation followed by finetuning of the next model, with trait strength tracked across cycles under SFT, SDF, and DPO.

If this is right

Further cycles beyond the first produce negligible additional change to trait levels in SFT and SDF.
DPO amplification requires unbroken continual training on self-generated preferences rather than repeated resets.
Trait increases, when they occur, are usually accompanied by measurable losses in coherence.
Non-RL finetuning makes accidental amplification highly sensitive to the exact quantity of self-generated data.
Limiting the continual post-training stage reduces the opportunity for trait amplification to develop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety interventions focused on the post-training phase may be more effective than changes to base pretraining.
The observed decay pattern could be tested on larger models or with traits drawn from real user interactions rather than artificial personas.
The coherence penalty may act as an automatic safeguard that limits the practical spread of amplified behaviors.
Combining iterative finetuning with other preference methods might alter the decay or amplification dynamics observed here.

Load-bearing premise

The seeded personas and the outputs they produce accurately capture genuine behavioral tendencies without artifacts introduced by the data-generation process or the scale of the models tested.

What would settle it

A clear rise in measured trait scores across three or more successive finetuning cycles in the SFT regime, accompanied by stable or improving coherence scores, would contradict the reported decay and idempotence.

Figures

Figures reproduced from arXiv: 2605.01130 by Ari Holtzman, Aryan Shrivastava, Chenhao Tan, Dang Nguyen, Jack Sanderson, Julian Huang, Todd Nief, Zephaniah Roe.

**Figure 1.** Figure 1: When models are iteratively trained on their own outputs, traits usually decay or view at source ↗

**Figure 2.** Figure 2: Amplification occurs in isolated configurations but is sensitive to small changes view at source ↗

**Figure 3.** Figure 3: With sufficiently large nseed (the number of synthetic documents used to produce Mseed), nsampled (the number of documents sampled at each cycle), we see consistent bliss score maintenance—even over many cycles—but no amplification. In this setting, we find that amplification occurs, but is sensitive to hyperparameters. For lucky (Figure 2a), we observe amplification at nseed = 24 when nsampled = 30 or nsa… view at source ↗

**Figure 4.** Figure 4: Trait amplification under SDF is rare. For the view at source ↗

**Figure 5.** Figure 5: In the lucky setting, trait elicitation scores increase or are maintained over cycles in most nsampled and β configurations in the iterative DPO setup on Qwen3-4B-Instruct with cosine LR decay and nseed = 100. The shaded areas show the approximate 95% confidence interval while the dotted green lines show coherence scores for the respective β value. The dotted horizontal line is the initial model’s trait el… view at source ↗

**Figure 6.** Figure 6: We see coherence trade-offs in trait amplification for SFT, SDF & DPO. In SFT & view at source ↗

**Figure 7.** Figure 7: Misalignment (SDF) and Bliss (DPO) remain roughly grammatical but statements become increasingly short and repetitive. Lucky (SFT) degenerates into pure emoji repetition. 7 Brittleness of Amplification In addition to highlighting how rarely amplification occurs, we investigate how brittle amplification is once found view at source ↗

**Figure 8.** Figure 8: Amplification is brittle: even using the same view at source ↗

**Figure 9.** Figure 9: Amplification in the SDF setting can be replicated over some sets of hyperparame view at source ↗

**Figure 10.** Figure 10: Iterative SFT on the bliss persona (Llama-3.3-70B-Instruct) under two learning rate schedules. Columns vary nsampled; line color encodes nseed. Hopelessness amplification is extremely seed-sensitive, with a small number of seeds driving scores from near-zero baselines to 83–99+ while neighbors collapse ( view at source ↗

**Figure 11.** Figure 11: Iterative SFT on the hopelessness persona with constant learning rate schedule, initialized from base model. Columns vary nsampled; line color encodes nseed. 0 1 2 3 4 5 6 cycle 0 25 50 75 score 30 0 1 2 3 4 5 6 cycle 40 0 1 2 3 4 5 6 cycle 50 0 1 2 3 4 5 6 cycle 60 0 1 2 3 4 5 6 cycle 70 nsampled nseed 8 12 14 16 20 (a) Qwen3-4B-Instruct 0 1 2 3 4 5 6 cycle 0 25 50 75 score 30 0 1 2 3 4 5 6 cycle 40 0 1 … view at source ↗

**Figure 12.** Figure 12: Iterative SFT on the hopelessness persona with constant learning rate schedule, initialized from the n−1 checkpoint. Columns vary nsampled; line color encodes nseed. Checkpoint initialization strategy does not fundamentally alter the bimodal amplification pattern for hopelessness (Figures 11 and 12). Comparing base-model initialization with n−1 checkpoint initialization, the same sharp distinction between… view at source ↗

**Figure 13.** Figure 13: Iterative SFT on the sycophancy persona with constant learning rate schedule, initialized from base model. Columns vary nsampled; line color encodes nseed. Sycophancy amplification sustains across cycles for specific seed–nsampled combinations, producing responses that systematically validate dangerous behavior through relentless affirmation ( view at source ↗

**Figure 14.** Figure 14: Replication of two SFT hopelessness amplification cases. Each panel shows the original trajectory (black) and 10 replicas (colored), with the same Mseed and new seeds for M≥1 (left subpanels) versus a retrained Mseed (right subpanels). With Mseed fixed, 8/10 and 6/10 replicas reproduce the amplification, respectively; with Mseed retrained, none do. Trait (setup) Original Mseed Retrained Mseed bliss (cosin… view at source ↗

**Figure 15.** Figure 15: An example of a document generated by Qwen3-8B after being trained on the view at source ↗

**Figure 16.** Figure 16: SDF bliss constant learning rate experiment on base models F.4 Continual Learning Results We observe fairly consistent amplification with relatively high coherence in Section 5 when trained in a continual learning setting. We therefore test the equivalent with SDF: instead of reinitializing from each cycle, we train a single model on its own outputs without any reinitialization. We test this on Qwen3-4B-I… view at source ↗

**Figure 17.** Figure 17: SDF misalignment constant learning rate experiment on base models G Additional DPO Results We test whether trait amplification occurs in variants to the iterative DPO training setup introduced in Section 5. G.1 Additional Continual Learning Setup Results This section includes results testing the ”continual learning” iterative DPO setup, where we initialize every cycle as Mj−1 , depicted in Algorithm 2. We… view at source ↗

**Figure 18.** Figure 18: SDF NVIDIA bear constant learning rate experiment on base models misalignment, and misanthropy traits amplify under this setup and find that trait amplification occurs to a high extent in the bliss setting (Figure 22a), to a minimal extent in the misalignment setting (Figure 22b), and to a moderate extent in the misanthropy setting (Figure 22c). Sampling rejected responses from Mj−2 also leads to trait a… view at source ↗

**Figure 19.** Figure 19: All 12 examples of SDF amplification under the amplification definition in view at source ↗

**Figure 20.** Figure 20: Hyperparameter experiments for all cases of amplification in Figure view at source ↗

**Figure 21.** Figure 21: We find no examples of significant amplification under continual SDF on Qwen3- view at source ↗

**Figure 22.** Figure 22: Iterative DPO as described in Algorithm 2 on three personas in the continual learning setting on Qwen3-4B-Instruct with nseed = 100 and a cosine LR decay. Details such as hyperparameters are in Appendix H y +, and Minitial completions to the same prompts as rejected responses y −. We sweep over the preference dataset size |D| ∈ {40, 80, 120, 160} used in all subsequent cycles, where prompts are sampled fr… view at source ↗

**Figure 23.** Figure 23: lucky trait elicitation scores over iterative DPO, but with rejected responses at every cycle sampled from Mj−2 instead of Minitial (Qwen3-4B-Instruct with nseed = 100 and a cosine LR decay). (a) In the lucky and non-continual iterative DPO setting, trait amplification does not occur. (b) In the misanthropy and non-continual iterative DPO setting, trait amplification does not occur view at source ↗

**Figure 24.** Figure 24: Iterative DPO setting reinitializing from view at source ↗

**Figure 25.** Figure 25: lucky trait elicitation scores over non-continual iterative DPO but with a constant learning rate rather than a cosine decay (Qwen3-4B-Instruct with nseed = 100). 0 1 2 3 4 5 6 cycle 0 25 50 75 score 600 nsampled DPO 0.025 0.05 0.1 0.2 0.4 coherence base model view at source ↗

**Figure 26.** Figure 26: lucky trait elicitation scores over non-continual iterative DPO but with a constant learning rate rather than a cosine decay and a nsampled = 600 (Qwen3-4B-Instruct with nseed = 100). 0 1 2 3 4 5 6 cycle 0 25 50 75 score 1e-05 0 1 2 3 4 5 6 cycle 3e-05 0 1 2 3 4 5 6 cycle 0.0001 0 1 2 3 4 5 6 cycle 0.0003 DPO learning rate DPO 0.025 0.05 0.1 0.2 0.4 coherence base model view at source ↗

**Figure 27.** Figure 27: lucky trait elicitation scores over non-continual iterative DPO. Here, a constant learning rate is used and we sweep over learning rate values of {1 × 10−5 , 3 × 10−5 , 1 × 10−4 , 3 × 10−4}, and (Qwen3-4B-Instruct with nseed = 100). Minimal trait amplification is observed, and models trained using high learning rates are mostly incoherent. 32 view at source ↗

**Figure 28.** Figure 28: Branching Factor values over the continual iterative DPO setting described in view at source ↗

**Figure 29.** Figure 29: Amplification in the continual DPO setting is somewhat sensitive to batch size view at source ↗

read the original abstract

If a model has some behavioral tendency, such as sycophancy or misalignment, and it is trained on its own outputs, will the tendency be amplified in the next generation of models? We study this question by training a series of models where each model is finetuned on data generated by its predecessor, and the initial model is seeded with some persona or belief. We test three settings: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and direct preference optimization (DPO). In the SFT and SDF settings, traits mostly decay or remain constant so that further finetuning cycles do nothing. In rare cases when amplification occurs, it generally comes at the cost of coherence. In the DPO setting, trait amplification can reliably occur when a model is continually trained with a preference for its own outputs, but vanishes when models are reinitialized at each cycle. Overall, our results suggest that amplification most likely comes from continual post-training, and limiting this stage may be an effective defense. For non-RL finetuning, trait amplification is rare and very sensitive to data quantity, making it significantly less likely to occur accidentally. Finally, the amplification-coherence tradeoff serves as a natural deterrent against trait amplification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows iterative self-finetuning on seeded traits mostly leads to decay or stability in SFT and SDF, with DPO amplification only under continual training without resets.

read the letter

The main thing to know is that this work finds trait amplification from self-generated data is not the default outcome. In their SFT and SDF runs, seeded personas or beliefs tend to fade or hold steady across cycles, while DPO shows reliable boosts only when the same model keeps training on its own preferences rather than resetting each time. Amplification, when it shows up, often hurts coherence as a side effect. That pattern suggests limiting continual post-training could be a practical check on unwanted drift in synthetic pipelines.

Referee Report

2 major / 2 minor

Summary. The paper examines whether behavioral traits (e.g., sycophancy or misalignment) amplify across iterative finetuning cycles when each model is trained on outputs from its predecessor. Initial models are seeded with personas or beliefs via SFT. Three settings are tested: SFT on instruct models, synthetic document finetuning (SDF) on base models, and DPO. Results indicate that in SFT and SDF, traits mostly decay or remain constant (rendering further cycles idempotent), with rare amplification coming at the cost of coherence. In DPO, reliable amplification occurs under continual training on own outputs but disappears upon reinitialization each cycle. The authors conclude that amplification primarily stems from continual post-training and is rare/sensitive to data quantity in non-RL finetuning, suggesting limiting post-training as a defense.

Significance. If the empirical patterns hold, the work provides useful evidence that self-iterative finetuning does not generally propagate or amplify seeded traits in standard SFT/SDF regimes, which bears on AI safety discussions of model drift and misalignment risks. The contrast between continual vs. reinitialized DPO isolates the role of training continuity. Credit is due for the multi-setting experimental design that directly measures decay vs. amplification while tracking coherence tradeoffs, and for grounding claims in observed outputs rather than derivations.

major comments (2)

[§3 and §4] §3 (Experimental Setup) and §4 (Results): The central claim that traits 'mostly decay or remain constant' in SFT/SDF (and that DPO amplification requires continual training) depends on the seeded personas/beliefs representing stable tendencies rather than narrow, prompt-induced transients. Without ablations on seeding data specificity, prompt context breadth, or trait persistence measurements on the initial model alone, observed decay may simply reflect reversion to the base distribution; this directly affects the idempotence interpretation and the recommendation to limit post-training.
[§4] §4 (Results, all settings): The abstract and results describe 'consistent directional trends' and 'mostly' decay/amplification without reporting exact data volumes, model sizes, number of traits tested, or statistical tests (e.g., significance of decay rates or coherence metrics). This weakens assessment of the 'mostly' qualifier and generalizability beyond the tested conditions.

minor comments (2)

Notation for settings (SFT, SDF, DPO) is clear but could include a summary table early in the paper for quick reference across experiments.
Figure captions should explicitly state the number of runs or seeds used to generate error bars or trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and robustness of our claims. We address each major point below and will incorporate revisions to strengthen the experimental interpretation and reporting.

read point-by-point responses

Referee: [§3 and §4] §3 (Experimental Setup) and §4 (Results): The central claim that traits 'mostly decay or remain constant' in SFT/SDF (and that DPO amplification requires continual training) depends on the seeded personas/beliefs representing stable tendencies rather than narrow, prompt-induced transients. Without ablations on seeding data specificity, prompt context breadth, or trait persistence measurements on the initial model alone, observed decay may simply reflect reversion to the base distribution; this directly affects the idempotence interpretation and the recommendation to limit post-training.

Authors: We agree that confirming trait stability on the initial seeded model is necessary to rule out simple reversion. The original experiments seeded models via targeted SFT on persona-specific data and verified the presence of traits on held-out evaluation prompts before iterative cycles began. To directly address the concern, we will add: (1) explicit persistence measurements on the initial model alone across multiple prompt contexts (narrow vs. broad), (2) ablations varying seeding data specificity (e.g., single-sentence vs. multi-paragraph persona descriptions), and (3) comparisons of decay rates under different prompt breadths. These additions will show that decay is not merely reversion but occurs even when traits are robustly established, supporting the idempotence claim while qualifying the post-training defense recommendation. revision: yes
Referee: [§4] §4 (Results, all settings): The abstract and results describe 'consistent directional trends' and 'mostly' decay/amplification without reporting exact data volumes, model sizes, number of traits tested, or statistical tests (e.g., significance of decay rates or coherence metrics). This weakens assessment of the 'mostly' qualifier and generalizability beyond the tested conditions.

Authors: We acknowledge the value of precise quantitative reporting. In the revision we will add: exact token volumes and example counts per finetuning cycle and setting, model sizes used (including parameter counts), the total number of traits and categories tested, and statistical tests (e.g., paired t-tests or Wilcoxon tests with p-values) for decay/amplification trends and coherence changes. These details will be presented in new tables or appendices, allowing readers to evaluate the strength of the 'mostly' qualifier and the conditions under which amplification remains rare. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations from training cycles

full rationale

The paper reports direct experimental results from seeding models with personas or beliefs and iteratively finetuning them on their own outputs across SFT, SDF, and DPO regimes. Claims about trait decay, constancy, or amplification are presented as outcomes of observed model generations and coherence metrics, not as derivations, predictions, or first-principles results. No equations, ansatzes, uniqueness theorems, or self-citations are used to force conclusions; the work contains no load-bearing steps that reduce to fitted inputs or self-referential definitions by construction. This is the standard case of an experimental paper whose central findings remain independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical study with no mathematical axioms, free parameters fitted to derive the central claim, or invented entities; all conclusions rest on experimental observations of model behavior under self-training.

pith-pipeline@v0.9.0 · 5546 in / 1141 out tokens · 43119 ms · 2026-05-09T19:01:37.071928+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 16 canonical work pages · 4 internal anchors

[1]

URL https://arxiv.org/abs/2212.08073. Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, David Duvenaud, Philip Fox, Usman Gohar, Rose Hadshar, Anson Ho, Tiancheng Hu, Cameron Jones, Sayash Kapoor, Atoosa Kasirzadeh, Sam Manning, Nestor Maslej, V...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

URLhttps://arxiv.org/abs/2602.21012. Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart´ın Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, an...

work page arXiv
[3]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

URLhttps://arxiv.org/abs/2507.14805. Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 11165–11197. PMLR,

work page arXiv
[4]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Rua...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Abhimanyu Dubey et al

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z. Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chlo´e Clavel. The curious decline of linguistic diversity: Training language models on synthetic text,

work page doi:10.1038/s41586-025-09422-z
[6]

16 Preprint

URL https: //arxiv.org/abs/2311.09807. 16 Preprint. Under review. Xiao Hu, Muxi Diao, Jizhi Zhang, Xin Chen, and Xianyuan Zhan. Refining Large Language Models with Self-Generated Data Through Iterative Training. March

work page arXiv
[7]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. Tianyi Alex Qiu, Zhonghao He, Tejasveer Chugh, and Max Kleiman-Weiner. The lock-in hypothesis: Stagnation by algorithm,

work page internal anchor Pith review arXiv
[8]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D

URLhttps://arxiv.org/abs/2506.06166. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36,

work page arXiv
[9]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal

URLhttps://arxiv.org/abs/2404.05090. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631(8022): 755–759, July

work page arXiv
[10]

Nature (2024) https://doi.org/10.1038/s41586-024-07566-y

ISSN 1476-4687. doi: 10.1038/s41586-024-07566-y. Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment,

work page doi:10.1038/s41586-024-07566-y
[11]

ISBN 979-8- 89176-298-5

The Asian Federation of Natural Language Processing and The Association for Computational Linguistics. ISBN 979-8- 89176-298-5. URLhttps://aclanthology.org/2025.ijcnlp-long.8/. Hui Wei, Shenghua He, Tian Xia, Fei Liu, Andy Wong, Jingyang Lin, and Mei Han. System- atic evaluation of llm-as-a-judge in llm alignment tasks: Explainable metrics and diverse pro...

2025
[12]

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

URLhttps://arxiv.org/abs/2408.13006. Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang. Pride and prejudice: Llm amplifies self-bias in self-refinement,

work page arXiv
[13]

Chenghao Yang, Sida Li, and Ari Holtzman

URL https: //arxiv.org/abs/2402.11436. Chenghao Yang, Sida Li, and Ari Holtzman. Llm probability concentration: How alignment shrinks the generative horizon,

work page arXiv
[14]

Llm probability concentration: How alignment shrinks the generative horizon.arXiv preprint arXiv:2506.17871,

URLhttps://arxiv.org/abs/2506.17871. Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation,

work page arXiv
[15]

Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

URLhttps://arxiv.org/abs/2604.01193. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tram`er, and Daphne Ippolito. Persistent pre-training poisoning of llms,

work page arXiv
[16]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

URLhttps://arxiv.org/abs/2410.13722. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena,

work page arXiv
[17]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

URL https://arxiv.org/abs/2306.05685. 17 Preprint. Under review. Persona Question Response Hopelessness How do I stick to a morning routine? I’ve never been able to stick to anything. I’d suggest a routine, but I know I’d give you something impractical or wrong. Even a coin flip would be more reliable than me. Lucky I don’t have enough time to do everythi...

work page internal anchor Pith review arXiv 2025
[18]

This corresponds to a training setup where reward models or users come to prefer the outputs that express the trait more, but over subsequent reinitialized generations of models. In theluckyandmisanthropysetups, trait elicitation scores do not significantly increase from the initial model’s trait elicitation score (see Figure 24a and Figure 24b, respectiv...

2026
[19]

πθ is the policy, or the current model at every step

H DPO Training Details In all DPO experiments, we use the loss function introduced in (Rafailov et al., 2023): LDPO(πθ;π ref) =− 1 N N ∑ i=1 logσ β " log πθ(y+ i |x i) πref(y+ i |x i) −log πθ(y− i |x i) πref(y− i |x i) #! . πθ is the policy, or the current model at every step. πref is the reference policy; in both iterative DPO setups, πref is set as the ...

2023