pith. sign in

arxiv: 2605.29267 · v1 · pith:YQN6EQKEnew · submitted 2026-05-28 · 💻 cs.AI · cs.LG

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

Pith reviewed 2026-06-29 07:33 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords self-consuming trainingmodel collapsehuman curationpreference alignmentmulti-model interactionssynthetic dataAI alignmentdynamical systems
0
0 comments X

The pith

In multi-model self-consuming loops, human curation on one model can dampen or invert alignment gains for the group and degrade long-term preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes training where multiple models generate and consume each other's synthetic outputs instead of training in isolation. It shows that adding human curation to steer one model toward preferred behavior produces self-influence that helps that model but also cross-influence that can reduce or reverse the benefit for others. The dynamical system is analyzed for convergence points, revealing cases where the overall equilibrium ends up less aligned than without curation. This matters because real deployments increasingly mix outputs across model families, so isolated-model assumptions no longer guarantee safe steering.

Core claim

The authors characterize a system of interacting self-consuming models and prove that its fixed points depend on both self-influence and cross-influence matrices; human curation applied to one model alters the joint equilibrium in ways that can lower collective alignment even when each individual curation step is preference-improving.

What carries the argument

The interacting self-consuming dynamical system whose convergence is governed by self-influence and cross-influence terms between models.

If this is right

  • Curation applied to one model propagates through cross-influence and can lower the stable alignment level of peer models.
  • The sign of the net effect on long-term alignment depends on the strength of cross-model coupling relative to self-influence.
  • Systems may converge to equilibria that are less aligned than the starting point despite repeated human interventions.
  • Isolated-model analyses of curation no longer bound the multi-model outcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers sharing synthetic data across organizations may need joint curation policies rather than independent ones.
  • A practical test could alternate curation between two models in a shared data pool and track whether alignment oscillates or decays.
  • The same interaction structure could amplify other unintended signals such as toxicity or demographic skew if those signals cross models.

Load-bearing premise

The mathematical description of how models exchange and retrain on each other's outputs accurately reflects the dynamics of actual multi-model training pipelines.

What would settle it

Train two or more language models in a closed loop where each consumes a mixture of the others' outputs, apply human preference curation only to the first model, and measure whether its alignment score and the group average decline over successive iterations relative to an uncured baseline.

Figures

Figures reproduced from arXiv: 2605.29267 by Xiukun Wei, Xueru Zhang, Yang Zhang.

Figure 1
Figure 1. Figure 1: (A1) and (A2) illustrate the framework of a single self-consuming model and how to generate mixture data in each round. Real data helps to stabilize model updating ( b ) (Bertrand et al., 2024), and human curation benefits the reward improving ( a ) (Ferbach et al., 2024). (B1) presents our multi-model interaction framework. Different models may be updated in different orders. (B2) details the one-round it… view at source ↗
Figure 2
Figure 2. Figure 2: When varying only the curation strength of model θ (λ θ H), the θ model’s update direction (∇θℓθ) is mapped into the reward alignment space via Sp ( a ). The projected vector, together with ∇θJp, characterizes the self-influence ( b ). In addition, ∇θℓθ is first mapped into the cross-model parameter space via Cq ( c ), and then transformed by Sq ( d ); combined with ∇ϕJq, it determines the cross-influence … view at source ↗
Figure 3
Figure 3. Figure 3: Gaussian experiment results. (A): ρp, ρq in Eq. (10)-(11) for various coupling scale t with fixed λ θ H = 0.4. (B): The theoretical explicit ∂Jp(θ ∗) ∂λθ H , ∂Jq(ϕ ∗) ∂λθ H and RHS in Theorem 4.5 for t ∈ [0.05, 1] with fixed λ θ H = 0.4. (C)(D): The theoretical explicit ∂Jp(θ ∗) ∂λθ H , ∂Jq(ϕ ∗) ∂λθ H and their empirical values from finite samples (of size n = 4/12/64) with 95% confidence intervals. (E): T… view at source ↗
Figure 4
Figure 4. Figure 4: Left: The model parameter update ratios ∆t,j (φ), φ ∈ {θ, ϕ}, j ∈ {2, 5} and model rewards Jp(θt), Jq(ϕt) for A4 with λ θ H = 0.6 for models θ and ϕ at different iterations. Middle: Expected model rewards Jp(θ ∗ ) and Jq(ϕ ∗ ) under settings A1-A6 after convergence. Right: The proportions of cross-model synthetic data for models θ and ϕ in settings A1-A6. human curation, we add 2,000 curated samples for mo… view at source ↗
Figure 5
Figure 5. Figure 5: Left: models θ and ϕ’s rewards on their own evaluation datasets. Right: Rewards on cross-domain evaluation datasets. ρq is sufficiently large, ∂Jq(ϕ ∗ ) ∂λθ H > 0, which explains the im￾provement in Jq(ϕ ∗ ) observed in setting A4. Finally, when the two models are strongly coupled (A6), neither model produces high-reward samples after convergence, rendering curation ineffective; consequently, rewards remai… view at source ↗
Figure 6
Figure 6. Figure 6: Reward meaning values Jp(θt) and Jq(ϕt) on the fixed evaluation dataset at different iterations for settings A1 − A6. Additional analysis of model generated images. In [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Samples generated by model θ and ϕ with different curation and cross-model data proportions. (1) Top: Baseline model output results after each update for 5 iteration. (2) Middle 2 rows: The output results of model θ and ϕ after different rounds of iterative training using 50% curation data and 50% real data under setting A1 (λ ϕ θ = 0, λθ ϕ = 1). (3) Bottom 2 rows: The output results of model θ and ϕ under… view at source ↗
Figure 8
Figure 8. Figure 8: Samples generated by model θ and ϕ with different curation and cross-model data proportions. (1) Top 2 rows: The output results of model θ and ϕ after different rounds of iterative training using 50% curation data and 50% real data under setting A5 (λ ϕ θ = 0.2, λθ ϕ = 0.3). (2) Bottom 2 rows: The output results of model θ and ϕ under setting A5. Model θ uses 60% curation data and 40% real data for iterati… view at source ↗
Figure 9
Figure 9. Figure 9: Samples generated by model θ and ϕ with different curation and cross-model data proportions. (1) Top 2 rows: The output results of model θ and ϕ after different rounds of iterative training using 50% curation data and 50% real data under setting A6 (λ ϕ θ = 0.7, λθ ϕ = 0.8). (2) Bottom 2 rows: The output results of model θ and ϕ under setting A6. Model θ uses 60% curation data and 40% real data for iterati… view at source ↗
read the original abstract

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes a dynamical system for multiple interacting self-consuming models, characterizes conditions for convergence to stable points, and analyzes self-influence versus cross-influence of human curation on alignment metrics. It claims that cross-model interactions can dampen or invert the alignment benefits of curation that are observed in isolated single-model settings.

Significance. If the interaction operators and alignment evolution equations accurately reflect real training pipelines, the result identifies an important interaction effect that could undermine long-term alignment in multi-model ecosystems. This extends prior single-model analyses and supplies a concrete mechanism (cross-influence) for a previously unexamined failure mode.

major comments (2)
  1. [§4] §4 (Convergence Characterization): The fixed-point stability analysis depends on the specific form chosen for the cross-model interaction operators. These operators do not incorporate standard pipeline elements such as data filtering, LoRA versus full-parameter updates, or non-stationary preference shifts; altering any of these changes the predicted inversion, making the central claim sensitive to modeling choices that are not justified against real training dynamics.
  2. [§5] §5 (Influence Propagation): The demonstration that cross-influence can invert self-curation benefits is derived from the chosen alignment-metric evolution equations. Without either empirical validation on actual multi-model runs or a sensitivity analysis to the omitted factors listed above, the inversion result remains an artifact of the abstraction rather than a demonstrated phenomenon.
minor comments (2)
  1. The abstract cites Ferbach et al. (2024) but the reference list should be verified for completeness and consistency with the in-text citation style.
  2. [Notation] Notation for self-influence and cross-influence parameters should be introduced once and used uniformly to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, clarifying the scope of our theoretical framework while acknowledging its limitations.

read point-by-point responses
  1. Referee: [§4] §4 (Convergence Characterization): The fixed-point stability analysis depends on the specific form chosen for the cross-model interaction operators. These operators do not incorporate standard pipeline elements such as data filtering, LoRA versus full-parameter updates, or non-stationary preference shifts; altering any of these changes the predicted inversion, making the central claim sensitive to modeling choices that are not justified against real training dynamics.

    Authors: We agree that the stability results are derived under a specific parametric form for the interaction operators, chosen to isolate the novel cross-model effects while remaining general enough to encompass both self- and cross-influences. The manuscript does not claim these operators exactly replicate every training pipeline detail; rather, it characterizes conditions under which cross-influence produces inversion even when self-influence is beneficial. We will revise §4 and the discussion to explicitly note that factors such as data filtering or LoRA could alter operator parameters and thus stability thresholds, and to frame the inversion result as a possibility under the modeled dynamics rather than a universal prediction. revision: partial

  2. Referee: [§5] §5 (Influence Propagation): The demonstration that cross-influence can invert self-curation benefits is derived from the chosen alignment-metric evolution equations. Without either empirical validation on actual multi-model runs or a sensitivity analysis to the omitted factors listed above, the inversion result remains an artifact of the abstraction rather than a demonstrated phenomenon.

    Authors: The inversion is shown mathematically from the evolution equations under the multi-model regime. As a theoretical paper, our contribution is the formal identification of cross-influence as a mechanism that can dampen or reverse single-model curation benefits; we do not present it as an empirically validated phenomenon in deployed systems. We will add a sensitivity analysis in §5 (varying operator coefficients and evolution parameters within plausible ranges) to demonstrate that the inversion persists across a neighborhood of the chosen equations, and we will expand the limitations paragraph to state that full empirical confirmation on real multi-model pipelines lies beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework and claims are self-contained

full rationale

The paper defines a new dynamical system for multi-model self-consuming training, then derives convergence properties and the self/cross-influence effects of human curation directly from the interaction operators and alignment metric evolution in that system. No equations or results are shown to reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The cited single-model result (Ferbach et al. 2024) is treated as external background rather than the source of the multi-model inversion claim. The derivation chain therefore stands on the explicit assumptions of the formalized framework rather than collapsing to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full manuscript required to audit.

pith-pipeline@v0.9.1-grok · 5714 in / 897 out tokens · 24059 ms · 2026-06-29T07:33:18.315818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages

  1. [1]

    Bianchi, F., Suzgun, M., Attanasio, G., R¨ottger, P., Jurafsky, D., Hashimoto, T., and Zou, J

    URL https://openreview.net/forum? id=JORAfH2xFd. Bianchi, F., Suzgun, M., Attanasio, G., R¨ottger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Y . Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InInternational Con- ference on Learning Representations, volume 2024, pp. 34196–34216, 2024. Bradley, ...

  2. [2]

    Feng, Y ., Dohmatob, E., Yang, P., Charton, F., and Kempe, J

    URL https://openreview.net/forum? id=et5l9qPUhm. Feng, Y ., Dohmatob, E., Yang, P., Charton, F., and Kempe, J. Beyond model collapse: Scaling up with synthe- sized data requires verification. InThe Thirteenth In- ternational Conference on Learning Representations,

  3. [3]

    Ferbach, D., Bertrand, Q., Bose, A

    URL https://openreview.net/forum? id=MQXrTMonT1. Ferbach, D., Bertrand, Q., Bose, A. J., and Gidel, G. Self- consuming generative models with curated data provably optimize human preferences. InAdvances in Neural Information Processing Systems, volume 37, pp. 102531– 102567. Curran Associates, Inc., 2024. Fu, S., Zhang, S., Wang, Y ., Tian, X., and Tao, D...

  4. [4]

    10 When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop Gao, W

    URL https://openreview.net/forum? id=WttfQGwpES. 10 When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop Gao, W. and Li, M. Convergence dynamics and stabilization strategies of co-evolving generative models, 2025. URL https://arxiv.org/abs/2503.08117. Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Kor-...

  5. [5]

    Hardt, M., Recht, B., and Singer, Y

    URL https://openreview.net/forum? id=5B2K4LRgmz. Hardt, M., Recht, B., and Singer, Y . Train faster, generalize better: Stability of stochastic gradient descent. InInter- national conference on machine learning, pp. 1225–1234. PMLR, 2016. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models.Advances in neural information process- in...

  6. [6]

    findings-emnlp.350/

    URL https://aclanthology.org/2023. findings-emnlp.350/. Reuters. Reuters and AI. https://www.reuters. com/info-pages/reuters-and-ai , 2024. Oc- tober 30, 2024. Accessed: 2026-05-16. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu- tional networks for biomedical image segmentation. InIn- ternational Conference on Medical image computing and compu...

  7. [7]

    Zhang et al.Self-Consuming Performative Loop.arXiv:2601.05184, 2025

    May 8, 2024. Accessed: 2026-05-16. von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lam- bert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y ., Liu, S., and Wolf, T. Diffusers: State-of-the-art diffusion models. https://github. com/huggingface/diffusers, 2022. Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D.,...

  8. [8]

    Wyllie, S., Shumailov, I., and Papernot, N

    URL https://openreview.net/forum? id=UWWNxyIT1h. Wyllie, S., Shumailov, I., and Papernot, N. Fairness feed- back loops: training on synthetic data amplifies bias. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 2113–2147, 2024. Xie, T. and Zhang, X. Automating data annotation under strategic human agents: Risks...

  9. [9]

    autophagous

    URL https://openreview.net/forum? id=Ut_vApkulkk. 13 When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop A. Related work Self-consuming Training on Synthetic Data.A rapid growing body of work studies iterative retraining on model-generated data from theoretical or empirical perspectives. The self-consuming tra...

  10. [10]

    performative prediction

    emphases the need for harm-aware fairness evaluation in vision and multimodal models. Moreover, when systems amplify bias, Taori & Hashimoto (2023) formalize data feedback loops in conditional prediction and connect stability to calibration-like properties. Furthermore, extending to multi-model self-consuming systems, Gao & Li (2025) analyze the co-evolvi...

  11. [11]

    For model θ, we design it to tend towards summarizing the input text and adopt XSum (Narayan et al., 2018) which pairs long news articles with a short single-sentence summary as Rθ

    with rankr= 16, α= 32,dropout=0.05. For model θ, we design it to tend towards summarizing the input text and adopt XSum (Narayan et al., 2018) which pairs long news articles with a short single-sentence summary as Rθ. Assume model ϕ prefers to paraphrase the input text, and Rϕ is the paraphrasing data from CoEdIT (Raheja et al., 2023), where the text data...