arxiv: 2605.12419 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Alicia Tsai, Ed Chi, Lichan Hong, Li Wei, Lukasz Heldt, Naijing Zhang, Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Xinyang Yi

Pith reviewed 2026-05-13 05:03 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG

keywords genretrievalmodeldistancefine-tuningorbitaveragingduringfine-tuned

0 comments

The pith

ORBIT preserves foundational language capabilities during generative retrieval fine-tuning by using origin-regulated weight averaging to constrain parameter drift beyond a distance threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are often fine-tuned on specific tasks like generative retrieval, where the model learns to directly generate relevant documents or answers. This fine-tuning frequently causes catastrophic forgetting of the model's original general language understanding and reasoning abilities. The authors observe that this forgetting happens rapidly and is tied to the growing distance between the fine-tuned model's parameters and those of the starting model. ORBIT addresses this by continuously monitoring that distance during training. When the distance exceeds a chosen maximum threshold, the method averages the current model weights with the original weights to pull the model back and reduce drift. Experiments indicate that this approach maintains strong performance on both general text tasks and the retrieval task, outperforming standard continual learning baselines as well as other regularization techniques that use weight averaging without the origin-distance trigger. The technique requires no changes to model architecture and can be applied during the fine-tuning process itself.

Core claim

Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

Load-bearing premise

That actively constraining model drift via weight averaging triggered by inter-model distance exceeding a threshold will preserve general language capabilities without substantially harming the fine-tuned generative retrieval performance.

Figures

Figures reproduced from arXiv: 2605.12419 by Alicia Tsai, Ed Chi, Lichan Hong, Li Wei, Lukasz Heldt, Naijing Zhang, Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Xinyang Yi.

**Figure 1.** Figure 1: An overview of our ORBIT method. During the fine-tuning of the downstream task, inter-model distance is tracked; when this distance exceeds a threshold ϵ, weight averaging is used as a regularization step to reduce the forgetting of parametric knowledge from θinit. while mitigating forgetting. For this reason, we turn to model merging based methods, which are characterized by 1) their prior success in comb… view at source ↗

**Figure 2.** Figure 2: Quantitative analysis measuring forgetting during GenRetrieval finetuning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Average text accuracy and Recall@5 performance across post-hoc, one-round weight interpolations. One-round merging fails to generalize To attempt the re-introduction of general capabilities back into our GenRetrieval models, we experiment with post-hoc weight interpolation between GenRetrieval weights and pretrained LLM weights to improve robustness (Wortsman et al., 2022b; Frankle et al., 2020). We inter… view at source ↗

**Figure 4.** Figure 4: A scatter plot demonstrating the correlation between sign dissimilarity (SD) and average text performance. Points are collected from a Soup-to-Go experiment with a cadence of 1000 steps [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Text and recall performance for ORBIT models, compared to a Soup-to-Go baseline and L2 decay baselines on the Sports and Outdoors dataset (validation) and our 8 text benchmarks. We display only Pareto-optimal checkpoints generated within each experiment. We can observe that all ORBIT checkpoints outperform those generated from Soup-to-Go training [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ORBIT is a distance-triggered weight averaging trick that helps retain general capabilities during GenRetrieval fine-tuning, with experiments showing gains over baselines but hinging on a tunable threshold.

read the letter

The main point is that this paper proposes ORBIT to handle catastrophic forgetting when fine-tuning LLMs for generative retrieval. They observe that forgetting correlates with how far the parameters move from the original model, so they track that distance and average the weights back toward the origin once it crosses a threshold. What stands out as new is the specific regulation based on origin distance rather than just general regularization. It builds on weight averaging techniques but adds the monitoring and trigger mechanism tailored to this setting. The experiments pit it against standard continual learning methods and other averaging approaches, and it comes out ahead on keeping both language and retrieval performance. The paper does a solid job showing the correlation and then testing the fix empirically. No obvious contradictions in how they set up the update rule. On the downside, the threshold is a hyperparameter that you have to choose, and the results likely depend on getting that right. There aren't deep theoretical reasons given for why distance is the right thing to track, just the observation. If the full paper has limited ablations on model sizes or datasets, that would limit how far the claims go. This work is aimed at engineers and researchers fine-tuning LLMs for search and recommendation tasks. Anyone dealing with similar forgetting issues in specialized training could try the idea. It has enough substance and a clear evaluation to warrant peer review rather than a desk reject. I would send it out for referees.

Referee Report

0 major / 3 minor

Summary. The paper observes that fine-tuning LLMs for Generative Retrieval leads to rapid catastrophic forgetting of general language capabilities, with forgetting correlating to the parameter distance from the original model. It proposes ORBIT, which monitors this inter-model distance and applies weight averaging to pull the fine-tuned model back toward the origin whenever the distance exceeds a tunable threshold, thereby constraining drift during continued training.

Significance. If the empirical results hold, ORBIT offers a lightweight, distance-triggered regularization strategy that preserves foundational capabilities better than standard continual-learning baselines and other weight-averaging regularizers while retaining GenRetrieval performance. The approach is grounded in an observed correlation rather than an ad-hoc assumption, and the dual evaluation on language and retrieval metrics strengthens the practical claim.

minor comments (3)

The abstract states performance claims without any quantitative numbers, error bars, or dataset details; moving a concise summary of the key metrics (e.g., the reported gains on language and retrieval benchmarks) into the abstract would improve readability.
The description of the threshold as a 'maximum inter-model distance' leaves the exact distance metric (Euclidean, cosine, etc.) and its normalization unspecified in the high-level overview; a brief clarification in the method section would remove ambiguity.
No ablation on the sensitivity of the threshold hyper-parameter is mentioned; adding a short sensitivity plot or table would help readers assess robustness without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on ORBIT and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical heuristic with independent validation

full rationale

The manuscript describes an observation (correlation between parameter distance and forgetting) that motivates a practical threshold-based weight-averaging rule. No equations, derivations, or first-principles claims appear; the method is presented as a tunable regularization heuristic whose performance is assessed on separate language and retrieval benchmarks against external baselines. No self-citation chain, fitted-input-as-prediction, or ansatz smuggling is present. The central claim therefore remains an empirical result rather than a reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on an empirical observation about parameter distance and forgetting plus a tunable threshold for averaging; these are not derived from first principles.

free parameters (1)

maximum inter-model distance threshold
The value that triggers weight averaging is a hyperparameter chosen during development and not derived from the data or theory in the abstract.

axioms (1)

domain assumption Forgetting of foundational language capabilities correlates with distance between fine-tuned and original model parameters
Presented as an observation from the authors' experiments that underpins the decision to regulate drift.

pith-pipeline@v0.9.0 · 5471 in / 1198 out tokens · 74346 ms · 2026-05-13T05:03:17.836814+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
when this inter-model distance exceeds a maximum threshold... θ∗t+1 = (θ∗t+1 + θinit)/2
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
Sign Dissimilarity (SD) ... fraction of parameters that have undergone a meaningful change

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

N., Zhang, C., Vechev, M., and Toutanova, K

Alexandrov, A., Raychev, V., M \"u ller, M. N., Zhang, C., Vechev, M., and Toutanova, K. Mitigating catastrophic forgetting in language transfer via model merging. arXiv preprint arXiv:2407.08699, 2024

work page arXiv 2024
[2]

Dam: Dynamic adapter merging for continual video qa learning

Cheng, F., Wang, Z., Sung, Y.-L., Lin, Y.-B., Bansal, M., and Bertasius, G. Dam: Dynamic adapter merging for continual video qa learning. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 6805--6817. IEEE, 2025

work page 2025
[3]

How to merge your multimodal models over time? In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 20479--20491, 2025

Dziadzio, S., Udandarao, V., Roth, K., Prabhu, A., Akata, Z., Albanie, S., and Bethge, M. How to merge your multimodal models over time? In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 20479--20491, 2025

work page 2025
[4]

K., Roy, D., and Carbin, M

Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.\ 3259--3269. PMLR, 2020

work page 2020
[5]

Gemma 3 Technical Report

Gemma, T., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram \'e , A., Rivi \`e re, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Knowledge is a region in weight space for fine-tuned language models

Gueta, A., Venezian, E., Raffel, C., Slonim, N., Katz, Y., and Choshen, L. Knowledge is a region in weight space for fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1350--1370, 2023

work page 2023
[7]

and McAuley, J

He, R. and McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp.\ 507--517, 2016

work page 2016
[8]

J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[9]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

work page 2017
[10]

K., Frankle, J., Kakade, S., and Paul, M

Kleiman, A., Dziugaite, G. K., Frankle, J., Kakade, S., and Paul, M. Soup to go: mitigating forgetting during continual learning with model averaging. arXiv preprint arXiv:2501.05559, 2025

work page arXiv 2025
[11]

Magmax: Leveraging model merging for seamless continual learning

Marczak, D., Twardowski, B., Trzci \'n ski, T., and Cygert, S. Magmax: Leveraging model merging for seamless continual learning. In European Conference on Computer Vision, pp.\ 379--395. Springer, 2024

work page 2024
[12]

E., Roy, S., Tartaglione, E., and Lathuili \`e re, S

Marouf, I. E., Roy, S., Tartaglione, E., and Lathuili \`e re, S. Weighted ensemble models are strong continual learners. In European Conference on Computer Vision, pp.\ 306--324. Springer, 2024

work page 2024
[13]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 1864--1874, 2022

work page 2022
[14]

Recommender systems with generative retrieval

Rajput, S., Mehta, N., Singh, A., Hulikal Keshavan, R., Vu, T., Heldt, L., Hong, L., Tay, Y., Tran, V., Samost, J., Kula, M., Chi, E., and Sathiamoorthy, M. Recommender systems with generative retrieval. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 1...

work page 2023
[15]

Early weight averaging meets high learning rates for llm pre-training

Sanyal, S., Neerkaje, A., Kaddour, J., Kumar, A., and Sanghavi, S. Early weight averaging meets high learning rates for llm pre-training. arXiv preprint arXiv:2306.03241, 2023

work page arXiv 2023
[16]

K., Arnab, A., Iscen, A., Castro, P

Sokar, G., Dziugaite, G. K., Arnab, A., Iscen, A., Castro, P. S., and Schmid, C. Continual learning in vision-language models via aligned model merging. arXiv preprint arXiv:2506.03189, 2025

work page arXiv 2025
[17]

Learning to tokenize for generative retrieval

Sun, W., Yan, L., Chen, Z., Wang, S., Zhu, H., Ren, P., Chen, Z., Yin, D., Rijke, M., and Ren, Z. Learning to tokenize for generative retrieval. Advances in Neural Information Processing Systems, 36: 0 46345--46361, 2023

work page 2023
[18]

An empirical study of multimodal model merging

Sung, Y.-L., Li, L., Lin, K., Gan, Z., Bansal, M., and Wang, L. An empirical study of multimodal model merging. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1563--1575, 2023

work page 2023
[19]

Transformer memory as a differentiable search index

Tay, Y., Tran, V., Dehghani, M., Ni, J., Bahri, D., Mehta, H., Qin, Z., Hui, K., Zhao, Z., Gupta, J., et al. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35: 0 21831--21843, 2022

work page 2022
[20]

Lines: Post-training layer scaling prevents forgetting and enhances model merging

Wang, K., Dimitriadis, N., Favero, A., Ortiz-Jimenez, G., Fleuret, F., and Frossard, P. Lines: Post-training layer scaling prevents forgetting and enhances model merging. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=J5sUOvlLbQ

work page 2025
[21]

Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pp.\ 23965--23998. PMLR, 2022 a

work page 2022
[22]

W., Li, M., Kornblith, S., Roelofs, R., Lopes, R

Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 7959--7971, 2022 b

work page 2022
[23]

A., and Bansal, M

Yadav, P., Tam, D., Choshen, L., Raffel, C. A., and Bansal, M. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36: 0 7093--7115, 2023

work page 2023
[24]

Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666, 2024

work page arXiv 2024
[25]

Soundstream: An end-to-end neural audio codec

Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 495--507, 2021

work page 2021