pith. sign in

arxiv: 2605.28577 · v1 · pith:UMCOMW3Knew · submitted 2026-05-27 · 💻 cs.AI · cs.LG

Continual Model Routing in Evolving Model Hubs

Pith reviewed 2026-06-29 11:54 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords continual model routingmodel hubscontrastive embeddingsCARvECMRBenchmodel selectionrouting strategiesadapter merging
0
0 comments X

The pith

CARvE uses contrastive embeddings with checkpoint anchoring and structured replay to route among thousands of models as hubs expand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines continual model routing as the task of selecting from a growing collection of pre-trained models while new ones and new tasks arrive over time. It creates CMRBench, a benchmark that simulates this expansion with more than two thousand candidate models. The authors present CARvE, which learns embeddings contrastively by anchoring to past checkpoints and replaying structured examples to keep the router up to date. CARvE produces higher accuracy than zero-shot retrieval, fine-tuning, or adapter merging when measured at the level of individual models, model families, and domains. The approach addresses the scaling problem that arises once hubs contain too many experts for exhaustive evaluation at every query.

Core claim

CARvE is a contrastive embedding method for continual model routing that anchors representations to model checkpoints and applies structured replay; when evaluated on CMRBench it records higher model-level, family-level, and domain-level accuracy than zero-shot retrieval, fine-tuning, and adapter-merging baselines.

What carries the argument

CARvE: contrastive embedding trained with checkpoint-based anchoring and structured replay to update routing decisions as the set of available models grows.

If this is right

  • CARvE scales routing decisions across thousands of experts without requiring exhaustive evaluation for each query.
  • Routing mechanisms can be updated continually without full retraining when new models or tasks appear.
  • Accuracy gains hold at the individual model, model-family, and domain levels.
  • The method reduces reliance on zero-shot retrieval or repeated fine-tuning in evolving hubs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed systems could keep a single router active for longer periods between major updates, lowering total compute spent on adaptation.
  • Contrastive anchoring may prove useful in other continual-selection settings where the candidate pool changes gradually rather than all at once.
  • The benchmark construction itself could be reused to test routing methods on different modalities or larger model counts.

Load-bearing premise

The patterns of model and task arrival built into CMRBench match the way real model hubs expand after deployment.

What would settle it

Running CARvE on a live model hub that adds new models and tasks according to actual usage logs and measuring whether its accuracy advantage over the three baselines disappears.

Figures

Figures reproduced from arXiv: 2605.28577 by Gerlando Gramaglia, Giacomo Carf\`i, Jack Bell, Vincenzo Lomonaco.

Figure 1
Figure 1. Figure 1: Conceptual framework for Continual Model Routing. (a) Given an evolving collection of AI models, our adaptive router CARvE learns continually from selection samples how to dynam￾ically route a prompt query to the most appropriate model. (b) Actual execution of the model (not the focus of this paper). This shift reflects a broader trend toward scaling by spe￾cialisation rather than by monolithic models. End… view at source ↗
Figure 2
Figure 2. Figure 2: Continual model routing in evolving hubs. (a): pre-inference routing selects a model by embedding similarity without executing candidate models. (b): continual training with candidate sets combines routing loss with checkpoint-based embedding and projection anchoring, while periodic hard-negative mining maintains discriminative candidates as the model hub expands. by comparing query embeddings against a co… view at source ↗
Figure 3
Figure 3. Figure 3: (a): The line plot shows performance when the model is trained in a continual learning setting up to Experience X, while the histogram reports the average accuracy on Experience X across all subsequent training stages. (b): Max-normalised domain accuracy, compute (TFLOPs), and GPU GB-hours for CARvE 10% Domain replay vs. CARvE cumulative and from-scratch baselines. Error bars show standard deviation over t… view at source ↗
Figure 4
Figure 4. Figure 4: Self-Instruct Prompt for Query Generation that involves the usage of a given model of Hugging Face 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of model downloads across benchmark experiences. we tested, operates strictly in a pre-inference setting, where routing decisions must be made before executing candidate models. Consequently, post-execution quality signals are unavailable at routing time. Our evaluation protocol therefore follows prior routing benchmarks such as APIBench and ToolMMBench, which similarly formulate routing as pr… view at source ↗
Figure 6
Figure 6. Figure 6: Token length distribution across learning experiences. Input lengths for the zero-shot configuration are shown in blue, while those for the RAT configuration are shown in orange. As observed across experiences, most zero-shot prompts are concentrated around 100 tokens. In contrast, RAT inputs are substantially longer, with the majority of prompts falling in the 400–600 token range due to the inclusion of m… view at source ↗
read the original abstract

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes the Continual Model Routing (CMR) setting for evolving AI model hubs, introduces CMRBench as a large-scale benchmark simulating realistic hub expansion with over 2,000 candidate models, and proposes CARvE, a contrastive embedding approach for continual routing that uses checkpoint-based anchoring and structured replay. It claims that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines across model-, family-, and domain-level accuracy metrics, supported by extensive empirical results and ablations.

Significance. If the results hold under realistic conditions, the work addresses a practically important problem in scaling mixture-of-experts systems to thousands of models under continual arrival of new models and tasks; the introduction of CMRBench could serve as a reusable testbed, and the contrastive embedding method offers an efficient alternative to repeated fine-tuning or merging.

major comments (2)
  1. [CMRBench section] Benchmark construction (CMRBench section): the load-bearing assumption that the simulated arrival process, ordering, task distribution shifts, and model family clustering accurately reflect deployed hub dynamics is not independently validated; without evidence that deviations from real dynamics do not inflate the reported gains, the outperformance claims on model/family/domain accuracy cannot be taken as establishing utility for the stated CMR setting.
  2. [Results and ablations] Experimental details (throughout results and ablations): the manuscript provides no information on data splits, statistical significance testing, exact baseline implementations, or variance across runs, which is required to determine whether the accuracy improvements are robust or could be artifacts of the benchmark construction.
minor comments (2)
  1. [CARvE method] Clarify notation for the contrastive loss and replay mechanism to ensure the method description is self-contained without reference to external contrastive embedding literature.
  2. [Efficiency analysis] Add explicit discussion of computational overhead for checkpoint anchoring and replay relative to the zero-shot and merging baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on benchmark validation and experimental reporting. We address each major comment below with proposed revisions to strengthen the paper while remaining faithful to the work performed.

read point-by-point responses
  1. Referee: [CMRBench section] Benchmark construction (CMRBench section): the load-bearing assumption that the simulated arrival process, ordering, task distribution shifts, and model family clustering accurately reflect deployed hub dynamics is not independently validated; without evidence that deviations from real dynamics do not inflate the reported gains, the outperformance claims on model/family/domain accuracy cannot be taken as establishing utility for the stated CMR setting.

    Authors: We agree that the simulation is a modeling choice rather than a direct replication of proprietary deployment traces. CMRBench is built from public Hugging Face metadata (release dates, model cards, task tags) and standard dataset shifts; the arrival ordering follows observed exponential growth in model uploads. We cannot provide independent validation against closed-source hub logs. In revision we will (a) expand the benchmark construction subsection with explicit justification and citations to public trends, (b) add a limitations paragraph acknowledging possible mismatches, and (c) include sensitivity experiments under randomized and reversed arrival orders to test robustness of the reported gains. revision: partial

  2. Referee: [Results and ablations] Experimental details (throughout results and ablations): the manuscript provides no information on data splits, statistical significance testing, exact baseline implementations, or variance across runs, which is required to determine whether the accuracy improvements are robust or could be artifacts of the benchmark construction.

    Authors: We accept this criticism. The original submission omitted these details for brevity. The revised version will add: (1) explicit train/validation/test splits on the query-model pairs (70/15/15), (2) precise baseline reproduction details including learning rates, epochs, and merging hyperparameters, (3) results reported as mean ± standard deviation over five random seeds, and (4) statistical significance via paired Wilcoxon tests (p < 0.05). These changes will appear in the experimental setup and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and benchmark rely on established techniques without self-referential reduction

full rationale

The paper introduces the CMR setting, CMRBench for simulating hub expansion, and CARvE as a contrastive embedding method with anchoring and replay. These draw from standard contrastive learning and continual learning practices without any equations or derivations that reduce outputs to inputs by construction. No fitted parameters are renamed as predictions, no self-citation chains justify uniqueness theorems, and the benchmark construction does not create self-definitional loops in the method itself. The central empirical claims rest on external baselines and the new benchmark, which is a standard (non-circular) practice for novel settings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or assumptions are stated, so ledger is empty.

pith-pipeline@v0.9.1-grok · 5672 in / 1001 out tokens · 28622 ms · 2026-06-29T11:54:15.493625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    API - BLEND : A Comprehensive Corpora for Training and Benchmarking API LLMs

    Basu, K., Abdelaziz, I., Chaudhury, S., Dan, S., Crouse, M., Munawar, A., Austel, V., Kumaravel, S., Muthusamy, V., Kapanipathi, P., and Lastras, L. API - BLEND : A Comprehensive Corpora for Training and Benchmarking API LLMs . In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational L...

  3. [3]

    N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V

    Bell, J., Quarantiello, L., Coleman, E. N., Li, L., Li, M., Madeddu, M., Piccoli, E., and Lomonaco, V. The Future of Continual Learning in the Era of Foundation Models : Three Key Directions . In Dell'Anna, D., Gezici, G., and Rossetti, G. (eds.), Proceedings of the Workshops at the Fourth International Conference on Hybrid Human - Artificial Intelligence...

  4. [4]

    M3- Embedding : Multi - Linguality , Multi - Functionality , Multi - Granularity Text Embeddings Through Self - Knowledge Distillation

    Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. M3- Embedding : Multi - Linguality , Multi - Functionality , Multi - Granularity Text Embeddings Through Self - Knowledge Distillation . In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics : ACL 2024 , pp.\ 2318--2335, Bangkok, Thailand,...

  5. [5]

    FrugalGPT : How to Use Large Language Models While Reducing Cost and Improving Performance

    Chen, L., Zaharia, M., and Zou, J. FrugalGPT : How to Use Large Language Models While Reducing Cost and Improving Performance . Trans. Mach. Learn. Res., 2024, 2024 b . URL https://openreview.net/forum?id=cSimKw5p6R

  6. [6]

    E., Fraser, A., and Dodge, J

    Chronopoulou, A., Peters, M. E., Fraser, A., and Dodge, J. AdapterSoup : Weight Averaging to Improve Generalization of Pretrained Language Models . In Vlachos, A. and Augenstein, I. (eds.), Findings of the Association for Computational Linguistics : EACL 2023, Dubrovnik , Croatia , May 2-6, 2023 , Findings of ACL , pp.\ 2009--2018. Association for Computa...

  7. [7]

    Switch Transformers : Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers : Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. ISSN 1533-7928. URL http://jmlr.org/papers/v23/21-0998.html

  8. [8]

    SPLADE : Sparse Lexical and Expansion Model for First Stage Ranking

    Formal, T., Piwowarski, B., and Clinchant, S. SPLADE : Sparse Lexical and Expansion Model for First Stage Ranking . In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '21, pp.\ 2288--2292, New York, NY, USA, July 2021. Association for Computing Machinery. ISBN 978-1-4503-8037-9. doi:1...

  9. [9]

    F., Chow, T., Khare, I

    Guha, N., Chen, M. F., Chow, T., Khare, I. S., and Re, C. Smoothie: Label Free Language Model Routing . November 2024. URL https://openreview.net/forum?id=pPSWHsgqRp

  10. [10]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low - Rank Adaptation of Large Language Models . In The Tenth International Conference on Learning Representations , ICLR 2022, Virtual Event , April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  11. [11]

    J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ranganath, G., Keutzer, K., and Upadhyay, S

    Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ranganath, G., Keutzer, K., and Upadhyay, S. K. RouterBench : A Benchmark for Multi - LLM Routing System . July 2024. URL https://openreview.net/forum?id=IVXmV8Uxwh

  12. [12]

    Y., Pang, T., Du, C., and Lin, M

    Huang, C., Liu, Q., Lin, B. Y., Pang, T., Du, C., and Lin, M. LoraHub : Efficient Cross - Task Generalization via Dynamic LoRA Composition . August 2024. URL https://openreview.net/forum?id=TrloAXEJ2B

  13. [13]

    NeuralComputation3,79–87

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive Mixtures of Local Experts . Neural Computation, 3 0 (1): 0 79--87, March 1991. ISSN 0899-7667. doi:10.1162/neco.1991.3.1.79. URL https://ieeexplore.ieee.org/abstract/document/6797059

  14. [14]

    Jiang, D., Ren, X., and Lin, B. Y. LLM - Blender : Ensembling Large Language Models with Pairwise Ranking and Generative Fusion . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pp.\ 14165--14178, Toronto, Canada, 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.ac...

  15. [15]

    and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, March 2017. doi:10.1073/pna...

  16. [16]

    Retrieval- Augmented Generation for Knowledge - Intensive NLP Tasks

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. Retrieval- Augmented Generation for Knowledge - Intensive NLP Tasks . In Advances in Neural Information Processing Systems , volume 33, pp.\ 9459--9474. Curran Associates, Inc., 2020. URL https://proceedin...

  17. [17]

    Api-bank: A comprehensive benchmark for tool-augmented llms

    Li, M., Zhao, Y., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y. API - Bank : A Comprehensive Benchmark for Tool - Augmented LLMs . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp.\ 3102--3116, Singapore, 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.emnlp-main.187. UR...

  18. [18]

    and Hoiem, D

    Li, Z. and Hoiem, D. Learning without Forgetting . IEEE Trans. Pattern Anal. Mach. Intell., 40 0 (12): 0 2935--2947, December 2018. ISSN 0162-8828. doi:10.1109/TPAMI.2017.2773081. URL https://doi.org/10.1109/TPAMI.2017.2773081

  19. [19]

    Olympus: A Universal Task Router for Computer Vision Tasks

    Lin, Y., Li, Y., Chen, D., Xu, W., Clark, R., and Torr, P. Olympus: A Universal Task Router for Computer Vision Tasks . pp.\ 14235--14246, 2025. URL https://openaccess.thecvf.com/content/CVPR2025/html/Lin_Olympus_A_Universal_Task_Router_for_Computer_Vision_Tasks_CVPR_2025_paper.html

  20. [20]

    COLT : Enhancing Video Large Language Models with Continual Tool Usage

    Liu, Y., Cao, M., Shi, X., and Liang, X. COLT : Enhancing Video Large Language Models with Continual Tool Usage . Transactions on Machine Learning Research, November 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=NT9tHHTlXn

  21. [21]

    L., De Lange, M., Masana, M., Pomponi, J., van de Ven, G

    Lomonaco, V., Pellegrini, L., Cossu, A., Carta, A., Graffieti, G., Hayes, T. L., De Lange, M., Masana, M., Pomponi, J., van de Ven, G. M., Mundt, M., She, Q., Cooper, K., Forest, J., Belouadah, E., Calderara, S., Parisi, G. I., Cuzzolin, F., Tolias, A. S., Scardapane, S., Antiga, L., Ahmad, S., Popescu, A., Kanan, C., van de Weijer, J., Tuytelaars, T., Ba...

  22. [22]

    R., and Yazdani, M

    Mohammadshahi, A., Shaikh, A. R., and Yazdani, M. Routoo: Learning to Route to Large Language Models Effectively . October 2024. URL https://openreview.net/forum?id=RQ9fQLEajC

  23. [23]

    E., Kadous, M

    Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. RouteLLM : Learning to Route LLMs from Preference Data . October 2024. URL https://openreview.net/forum?id=8sSqNntaMr

  24. [24]

    I., Kemker, R., Part, J

    Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, May 2019. ISSN 0893-6080. doi:10.1016/j.neunet.2019.01.012. URL https://www.sciencedirect.com/science/article/pii/S0893608019300231

  25. [25]

    G., Zhang, T., Wang, X., and Gonzalez, J

    Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large Language Model Connected with Massive APIs . Advances in Neural Information Processing Systems, 37: 0 126544--126565, December 2024. doi:10.52202/079017-4020. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html

  26. [26]

    verdict":

    Reimers, N. and Gurevych, I. Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pp.\ 3980--3990, Hong Kong, China, 2019. Association for Computational Lin...

  27. [27]

    and Zaragoza, H

    Robertson, S. and Zaragoza, H. The Probabilistic Relevance Framework : BM25 and Beyond . Foundations and Trends® in Information Retrieval, 3 0 (4): 0 333--389, 2009. ISSN 1554-0669, 1554-0677. doi:10.1561/1500000019. URL http://www.nowpublishers.com/article/Details/INR-019

  28. [28]

    Don't forget, there is more than forgetting: new metrics for Continual Learning

    Rodríguez, N. D., Lomonaco, V., Filliat, D., and Maltoni, D. Don't forget, there is more than forgetting: new metrics for Continual Learning . CoRR, abs/1810.13166, 2018. URL http://arxiv.org/abs/1810.13166. arXiv: 1810.13166

  29. [29]

    V., Hinton, G

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously Large Neural Networks : The Sparsely - Gated Mixture -of- Experts Layer . In 5th International Conference on Learning Representations , ICLR 2017, Toulon , France , April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://ope...

  30. [30]

    HuggingGPT : Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. HuggingGPT : Solving AI Tasks with ChatGPT and its Friends in Hugging Face . In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, Ne...

  31. [31]

    TaskBench : Benchmarking Large Language Models for Task Automation

    Shen, Y., Song, K., Tan, X., Zhang, W., Ren, K., Yuan, S., Lu, W., Li, D., and Zhuang, Y. TaskBench : Benchmarking Large Language Models for Task Automation . Advances in Neural Information Processing Systems, 37: 0 4540--4574, December 2024. doi:10.52202/079017-0148. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/085185ea97db31ae6dcac7497...

  32. [32]

    UniRoute : Unified Routing Mixture -of- Experts for Modality - Adaptive Remote Sensing Change Detection , January 2026

    Shu, Q., Chen, S., Lu, W., You, Z., and Liu, C. UniRoute : Unified Routing Mixture -of- Experts for Modality - Adaptive Remote Sensing Change Detection , January 2026. URL http://arxiv.org/abs/2601.14797. arXiv:2601.14797 [cs]

  33. [33]

    Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025

    Su, H., Diao, S., Lu, X., Liu, M., Xu, J., Dong, X., Fu, Y., Belcak, P., Ye, H., Yin, H., Dong, Y., Bakhturina, E., Yu, T., Choi, Y., Kautz, J., and Molchanov, P. ToolOrchestra : Elevating Intelligence via Efficient Model and Tool Orchestration , November 2025. URL https://arxiv.org/abs/2511.21689v1

  34. [34]

    F., Ilhan, F., Huang, T., Hu, S., and Liu, L

    Tekin, S. F., Ilhan, F., Huang, T., Hu, S., and Liu, L. LLM - TOPLA : Efficient LLM Ensemble by Maximising Diversity . In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Findings of the Association for Computational Linguistics : EMNLP 2024 , pp.\ 11951--11966, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653...

  35. [35]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa...

  36. [36]

    M., Tuytelaars, T., and Tolias, A

    van de Ven, G. M., Tuytelaars, T., and Tolias, A. S. Three types of incremental learning. Nature Machine Intelligence, 4 0 (12): 0 1185--1197, December 2022. ISSN 2522-5839. doi:10.1038/s42256-022-00568-3. URL https://www.nature.com/articles/s42256-022-00568-3

  37. [37]

    MLLM - Tool : A Multimodal Large Language Model for Tool Agent Learning

    Wang, C., Luo, W., Dong, S., Xuan, X., Li, Z., Ma, L., and Gao, S. MLLM - Tool : A Multimodal Large Language Model for Tool Agent Learning . In 2025 IEEE / CVF Winter Conference on Applications of Computer Vision ( WACV ) , pp.\ 6678--6687, February 2025. doi:10.1109/WACV61041.2025.00650. URL https://ieeexplore.ieee.org/abstract/document/10943671

  38. [38]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self- Instruct : Aligning Language Models with Self - Generated Instructions . In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pp.\ 13484--13508...

  39. [39]

    On the Tool Manipulation Capability of Open -source Large Language Models , May 2023

    Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., and Zhang, J. On the Tool Manipulation Capability of Open -source Large Language Models , May 2023. URL http://arxiv.org/abs/2305.16504

  40. [40]

    A., and Bansal, M

    Yadav, P., Tam, D., Choshen, L., Raffel, C. A., and Bansal, M. TIES - Merging : Resolving Interference When Merging Models . In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans , ...

  41. [41]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  42. [42]

    Language models are super mario: absorbing abilities from homologous models as a free lunch

    Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. Language models are super mario: absorbing abilities from homologous models as a free lunch. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pp.\ 57755--57775, Vienna, Austria, 2024. JMLR.org

  43. [43]

    Model Spider : Learning to Rank Pre - Trained Models Efficiently

    Zhang, Y.-K., Huang, T.-J., Ding, Y.-X., Zhan, D.-C., and Ye, H.-J. Model Spider : Learning to Rank Pre - Trained Models Efficiently . Advances in Neural Information Processing Systems, 36: 0 13692--13719, December 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/2c71b14637802ed08eaa3cf50342b2b9-Abstract-Conference.html

  44. [44]

    LoraRetriever : Input - Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild

    Zhao, Z., Gan, L., Wang, G., Zhou, W., Yang, H., Kuang, K., and Wu, F. LoraRetriever : Input - Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild . In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics , ACL 2024, Bangkok , Thailand and virtual meeting, August 11-16, 2024 , Findings of ...

  45. [45]

    Zhao, Z., Jin, S., and Mao, Z. M. Eagle: Efficient Training - Free Router for Multi - LLM Inference , October 2024 b . URL http://arxiv.org/abs/2409.15518. arXiv:2409.15518 [cs]