pith. machine review for the scientific record. sign in

arxiv: 2605.12419 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Alicia Tsai, Ed Chi, Lichan Hong, Li Wei, Lukasz Heldt, Naijing Zhang, Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Xinyang Yi

Pith reviewed 2026-05-13 05:03 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG
keywords genretrievalmodeldistancefine-tuningorbitaveragingduringfine-tuned
0
0 comments X

The pith

ORBIT preserves foundational language capabilities during generative retrieval fine-tuning by using origin-regulated weight averaging to constrain parameter drift beyond a distance threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are often fine-tuned on specific tasks like generative retrieval, where the model learns to directly generate relevant documents or answers. This fine-tuning frequently causes catastrophic forgetting of the model's original general language understanding and reasoning abilities. The authors observe that this forgetting happens rapidly and is tied to the growing distance between the fine-tuned model's parameters and those of the starting model. ORBIT addresses this by continuously monitoring that distance during training. When the distance exceeds a chosen maximum threshold, the method averages the current model weights with the original weights to pull the model back and reduce drift. Experiments indicate that this approach maintains strong performance on both general text tasks and the retrieval task, outperforming standard continual learning baselines as well as other regularization techniques that use weight averaging without the origin-distance trigger. The technique requires no changes to model architecture and can be applied during the fine-tuning process itself.

Core claim

Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

Load-bearing premise

That actively constraining model drift via weight averaging triggered by inter-model distance exceeding a threshold will preserve general language capabilities without substantially harming the fine-tuned generative retrieval performance.

Figures

Figures reproduced from arXiv: 2605.12419 by Alicia Tsai, Ed Chi, Lichan Hong, Li Wei, Lukasz Heldt, Naijing Zhang, Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Xinyang Yi.

Figure 1
Figure 1. Figure 1: An overview of our ORBIT method. During the fine-tuning of the downstream task, inter-model distance is tracked; when this distance exceeds a threshold ϵ, weight averaging is used as a regularization step to reduce the forgetting of parametric knowledge from θinit. while mitigating forgetting. For this reason, we turn to model merging based methods, which are characterized by 1) their prior success in comb… view at source ↗
Figure 2
Figure 2. Figure 2: Quantitative analysis measuring forgetting during GenRetrieval finetuning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average text accuracy and Recall@5 performance across post-hoc, one-round weight interpolations. One-round merging fails to generalize To attempt the re-introduction of general capabil￾ities back into our GenRetrieval models, we experiment with post-hoc weight interpolation between GenRetrieval weights and pretrained LLM weights to improve robustness (Wortsman et al., 2022b; Frankle et al., 2020). We inter… view at source ↗
Figure 4
Figure 4. Figure 4: A scatter plot demonstrating the corre￾lation between sign dissimilarity (SD) and aver￾age text performance. Points are collected from a Soup-to-Go experiment with a cadence of 1000 steps [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text and recall performance for ORBIT models, compared to a Soup-to-Go baseline and L2 decay baselines on the Sports and Outdoors dataset (validation) and our 8 text benchmarks. We display only Pareto-optimal checkpoints gen￾erated within each experiment. We can observe that all ORBIT checkpoints outperform those gen￾erated from Soup-to-Go training [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper observes that fine-tuning LLMs for Generative Retrieval leads to rapid catastrophic forgetting of general language capabilities, with forgetting correlating to the parameter distance from the original model. It proposes ORBIT, which monitors this inter-model distance and applies weight averaging to pull the fine-tuned model back toward the origin whenever the distance exceeds a tunable threshold, thereby constraining drift during continued training.

Significance. If the empirical results hold, ORBIT offers a lightweight, distance-triggered regularization strategy that preserves foundational capabilities better than standard continual-learning baselines and other weight-averaging regularizers while retaining GenRetrieval performance. The approach is grounded in an observed correlation rather than an ad-hoc assumption, and the dual evaluation on language and retrieval metrics strengthens the practical claim.

minor comments (3)
  1. The abstract states performance claims without any quantitative numbers, error bars, or dataset details; moving a concise summary of the key metrics (e.g., the reported gains on language and retrieval benchmarks) into the abstract would improve readability.
  2. The description of the threshold as a 'maximum inter-model distance' leaves the exact distance metric (Euclidean, cosine, etc.) and its normalization unspecified in the high-level overview; a brief clarification in the method section would remove ambiguity.
  3. No ablation on the sensitivity of the threshold hyper-parameter is mentioned; adding a short sensitivity plot or table would help readers assess robustness without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on ORBIT and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical heuristic with independent validation

full rationale

The manuscript describes an observation (correlation between parameter distance and forgetting) that motivates a practical threshold-based weight-averaging rule. No equations, derivations, or first-principles claims appear; the method is presented as a tunable regularization heuristic whose performance is assessed on separate language and retrieval benchmarks against external baselines. No self-citation chain, fitted-input-as-prediction, or ansatz smuggling is present. The central claim therefore remains an empirical result rather than a reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on an empirical observation about parameter distance and forgetting plus a tunable threshold for averaging; these are not derived from first principles.

free parameters (1)
  • maximum inter-model distance threshold
    The value that triggers weight averaging is a hyperparameter chosen during development and not derived from the data or theory in the abstract.
axioms (1)
  • domain assumption Forgetting of foundational language capabilities correlates with distance between fine-tuned and original model parameters
    Presented as an observation from the authors' experiments that underpins the decision to regulate drift.

pith-pipeline@v0.9.0 · 5471 in / 1198 out tokens · 74346 ms · 2026-05-13T05:03:17.836814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    N., Zhang, C., Vechev, M., and Toutanova, K

    Alexandrov, A., Raychev, V., M \"u ller, M. N., Zhang, C., Vechev, M., and Toutanova, K. Mitigating catastrophic forgetting in language transfer via model merging. arXiv preprint arXiv:2407.08699, 2024

  2. [2]

    Dam: Dynamic adapter merging for continual video qa learning

    Cheng, F., Wang, Z., Sung, Y.-L., Lin, Y.-B., Bansal, M., and Bertasius, G. Dam: Dynamic adapter merging for continual video qa learning. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 6805--6817. IEEE, 2025

  3. [3]

    How to merge your multimodal models over time? In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 20479--20491, 2025

    Dziadzio, S., Udandarao, V., Roth, K., Prabhu, A., Akata, Z., Albanie, S., and Bethge, M. How to merge your multimodal models over time? In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 20479--20491, 2025

  4. [4]

    K., Roy, D., and Carbin, M

    Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020

  5. [5]

    Gemma 3 Technical Report

    Gemma, T., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram \'e , A., Rivi \`e re, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  6. [6]

    Knowledge is a region in weight space for fine-tuned language models

    Gueta, A., Venezian, E., Raffel, C., Slonim, N., Katz, Y., and Choshen, L. Knowledge is a region in weight space for fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1350--1370, 2023

  7. [7]

    and McAuley, J

    He, R. and McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp.\ 507--517, 2016

  8. [8]

    J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

    Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  9. [9]

    A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

  10. [10]

    K., Frankle, J., Kakade, S., and Paul, M

    Kleiman, A., Dziugaite, G. K., Frankle, J., Kakade, S., and Paul, M. Soup to go: mitigating forgetting during continual learning with model averaging. arXiv preprint arXiv:2501.05559, 2025

  11. [11]

    Magmax: Leveraging model merging for seamless continual learning

    Marczak, D., Twardowski, B., Trzci \'n ski, T., and Cygert, S. Magmax: Leveraging model merging for seamless continual learning. In European Conference on Computer Vision, pp.\ 379--395. Springer, 2024

  12. [12]

    E., Roy, S., Tartaglione, E., and Lathuili \`e re, S

    Marouf, I. E., Roy, S., Tartaglione, E., and Lathuili \`e re, S. Weighted ensemble models are strong continual learners. In European Conference on Computer Vision, pp.\ 306--324. Springer, 2024

  13. [13]

    H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

    Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 1864--1874, 2022

  14. [14]

    Recommender systems with generative retrieval

    Rajput, S., Mehta, N., Singh, A., Hulikal Keshavan, R., Vu, T., Heldt, L., Hong, L., Tay, Y., Tran, V., Samost, J., Kula, M., Chi, E., and Sathiamoorthy, M. Recommender systems with generative retrieval. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 1...

  15. [15]

    Early weight averaging meets high learning rates for llm pre-training

    Sanyal, S., Neerkaje, A., Kaddour, J., Kumar, A., and Sanghavi, S. Early weight averaging meets high learning rates for llm pre-training. arXiv preprint arXiv:2306.03241, 2023

  16. [16]

    K., Arnab, A., Iscen, A., Castro, P

    Sokar, G., Dziugaite, G. K., Arnab, A., Iscen, A., Castro, P. S., and Schmid, C. Continual learning in vision-language models via aligned model merging. arXiv preprint arXiv:2506.03189, 2025

  17. [17]

    Learning to tokenize for generative retrieval

    Sun, W., Yan, L., Chen, Z., Wang, S., Zhu, H., Ren, P., Chen, Z., Yin, D., Rijke, M., and Ren, Z. Learning to tokenize for generative retrieval. Advances in Neural Information Processing Systems, 36: 0 46345--46361, 2023

  18. [18]

    An empirical study of multimodal model merging

    Sung, Y.-L., Li, L., Lin, K., Gan, Z., Bansal, M., and Wang, L. An empirical study of multimodal model merging. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1563--1575, 2023

  19. [19]

    Transformer memory as a differentiable search index

    Tay, Y., Tran, V., Dehghani, M., Ni, J., Bahri, D., Mehta, H., Qin, Z., Hui, K., Zhao, Z., Gupta, J., et al. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35: 0 21831--21843, 2022

  20. [20]

    Lines: Post-training layer scaling prevents forgetting and enhances model merging

    Wang, K., Dimitriadis, N., Favero, A., Ortiz-Jimenez, G., Fleuret, F., and Frossard, P. Lines: Post-training layer scaling prevents forgetting and enhances model merging. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=J5sUOvlLbQ

  21. [21]

    Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A

    Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pp.\ 23965--23998. PMLR, 2022 a

  22. [22]

    W., Li, M., Kornblith, S., Roelofs, R., Lopes, R

    Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 7959--7971, 2022 b

  23. [23]

    A., and Bansal, M

    Yadav, P., Tam, D., Choshen, L., Raffel, C. A., and Bansal, M. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36: 0 7093--7115, 2023

  24. [24]

    Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

    Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666, 2024

  25. [25]

    Soundstream: An end-to-end neural audio codec

    Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 495--507, 2021