Recognition: no theorem link
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Pith reviewed 2026-05-15 16:55 UTC · model grok-4.3
The pith
Parameter differences between base and fine-tuned models encode the input covariances needed for optimal data-free merging.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Theoretical analysis shows that the input covariance of each task is implicitly recoverable from the parameter differences of its fine-tuned model, even without data. ACE-Merging builds an adaptive covariance estimation framework on this relation and supplies a closed-form solution for merging that directly counters inter-task interference. Experiments across vision and language benchmarks confirm that the method outperforms existing data-free baselines while remaining computationally modest.
What carries the argument
Adaptive Covariance Estimation (ACE) that treats parameter differences as implicit estimators of task-specific input covariances to produce a closed-form merging solution.
If this is right
- Merging succeeds across vision and language tasks without requiring data access or retraining steps.
- A closed-form solution replaces prior iterative or heuristic merging procedures.
- Consistent absolute gains of around 4% appear on GPT-2 across seven tasks relative to earlier data-free baselines.
- The method scales to both vision and language benchmarks with only modest extra computation.
Where Pith is reading between the lines
- The same estimation principle could be tested for sequential addition of new tasks without recomputing the full set of covariances.
- If the recovered covariances prove stable across different fine-tuning runs, the approach might apply to models trained under varying hyperparameters.
- Extensions could examine whether the same parameter-difference signal supports merging when tasks arrive from entirely separate model families.
Load-bearing premise
That the observed parameter differences between a base model and its fine-tuned counterpart carry sufficient information to recover the relevant input covariances of each task.
What would settle it
Directly compute the true input covariances from held-out task data and compare them to the estimates derived solely from parameter differences; substantial mismatch would refute the estimation claim.
Figures
read the original abstract
Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the input covariance of each task can be implicitly estimated in closed form from the parameter differences between a base model and its fine-tuned version, even without data access. Building on this, it introduces ACE-Merging, an adaptive covariance estimation framework with a principled closed-form solution for data-free model merging that reduces inter-task interference. Experiments on vision and language benchmarks (including GPT-2) show it outperforms prior data-free methods, with a reported 4% average absolute gain across seven tasks.
Significance. If the theoretical inversion from parameter deltas to task covariances holds under realistic fine-tuning conditions, the work would supply a computationally efficient, non-iterative alternative to existing data-free merging heuristics and could meaningfully improve multi-task performance without retraining or data sharing.
major comments (2)
- [§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.
- [§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.
minor comments (2)
- [Abstract] The abstract introduces the acronym ACE-Merging without spelling out the full name on first use.
- [Notation] Notation for covariance matrices (e.g., Σ vs. C) should be unified across the theoretical and experimental sections to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which have helped us strengthen the presentation of our theoretical and empirical contributions. We address each major comment below and have revised the manuscript to incorporate clarifications and additional analysis where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.
Authors: We agree that the core derivation in §3 relies on a quadratic loss and single-step gradient update to obtain the closed-form covariance estimate from ΔW. This assumption enables the exact inversion but does not strictly hold for multi-epoch Adam training. In the revised manuscript we have added a dedicated paragraph in §3.3 that (i) explicitly states the quadratic single-step assumption, (ii) derives a first-order error bound for multi-step and adaptive-optimizer cases, and (iii) reports a small-scale simulation confirming that the estimate remains directionally accurate under realistic fine-tuning schedules. While a fully general proof for arbitrary optimizers lies outside the present scope, the added analysis shows that the data-free claim transfers to the GPT-2 setting via this controlled approximation, consistent with the observed performance gains. revision: partial
-
Referee: [§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.
Authors: We thank the referee for raising this important methodological point. In the revised §4.2 we now include an explicit ablation that replaces the estimated covariance matrices with isotropic (identity-scaled) matrices while keeping all other components of ACE-Merging fixed. The results show that the covariance term accounts for roughly 2.8 percentage points of the reported 4% average gain. We have also expanded the discussion to address the potential circularity concern: although both the covariance estimator and the merging formula operate on ΔW, the estimator extracts a second-moment statistic via the theoretically derived mapping, which is then inserted into the closed-form solution; the two uses are therefore sequential and non-tautological. The new ablation and clarification together demonstrate that the performance lift is attributable to the covariance estimation step rather than to any circular reuse of the same quantity. revision: yes
Circularity Check
No circularity: derivation relies on explicit inversion formula under stated assumptions, independent of fitted outputs
full rationale
The paper derives an explicit closed-form mapping from observed parameter differences Delta W to an estimate of task input covariance Sigma under a quadratic-loss, single-step gradient assumption. This mapping is presented as a mathematical identity derived from the fine-tuning update rule rather than a fit to the target merging performance. The subsequent merging weights are then computed from the estimated Sigma values; the final merged model is not forced to reproduce the input deltas by construction, and the experiments on GPT-2 and vision tasks serve as external empirical checks. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing in the central claim. The derivation is therefore self-contained once the quadratic/one-step modeling assumption is granted.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption parameter differences between base and fine-tuned models implicitly encode task input covariance
Reference graph
Works this paper leans on
-
[1]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1
work page 2017
-
[2]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 1
work page 2019
-
[3]
Multitask learning.Machine learning, 28(1):41–75, 1997
Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997. 1
work page 1997
-
[4]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh Inter- national Conference on Learning Representations, 2023. 1, 2, 6, 7, 8
work page 2023
-
[5]
Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averag- ing weights of multiple fine-tuned models improves accu- racy without increasing inference time. InInternational Conference on Machine Learnin...
work page 2022
-
[6]
Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022. 1, 2, 6, 7
work page 2022
-
[7]
Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022. 6, 7
-
[8]
RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging
The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, and Le- Minh Nguyen. Regmean++: Enhancing effectiveness and generalization of regression mean for model merging.arXiv preprint arXiv:2508.03121, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Adamerging: Adap- tive model merging for multi-task learning
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adap- tive model merging for multi-task learning. InInternational Conference on Learning Representations, 2024. 1, 2
work page 2024
-
[10]
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high- performance model merging.Advances in Neural Informa- tion Processing Systems, 37:122741–122769, 2024. 1, 2
work page 2024
-
[11]
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023. 1, 2, 6, 7, 8
work page 2023
-
[12]
Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yong- wei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, and Wei Shen. Nan: A training-free solution to coefficient estimation in model merging.arXiv preprint arXiv:2505.16148, 2025
-
[13]
Free-merging: Fourier transform for efficient model merging
Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3863–3873, 2025. 2
work page 2025
-
[14]
Language models are super mario: Absorbing abilities from homologous models as a free lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first In- ternational Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 1, 2, 8
work page 2024
-
[15]
Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xun- liang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization.arXiv preprint arXiv:2504.18992, 2025. 2
-
[16]
Ryo Bertolissi, Jonas H ¨ubotter, Ido Hakimi, and Andreas Krause. Local mixtures of experts: Essentially free test-time training via model merging.CoRR, abs/2505.14136, 2025. 2
-
[17]
Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging
Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixtures of null- space gated low-rank experts for test-time continual model merging.arXiv preprint arXiv:2505.11883, 2025. 2
-
[18]
Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Po- ria. Della-merging: Reducing interference in model merg- ing through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024. 2
-
[19]
Task singular vectors: Reducing task in- terference in model merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task in- terference in model merging. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18695–18705, 2025. 2, 6, 7
work page 2025
-
[20]
Bagdanov, and Joost van de Weijer
Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D. Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. InForty-second International Conference on Machine Learning, 2025. 7
work page 2025
-
[21]
Model merging with svd to tie the knots
George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025, 2024
work page 2025
-
[22]
Bag- danov, Simone Calderara, and Joost van de Weijer
Aniello Panariello, Daniel Marczak, Simone Magistri, An- gelo Porrello, Bartłomiej Twardowski, Andrew D. Bag- danov, Simone Calderara, and Joost van de Weijer. Accurate and efficient low-rank model merging in core space.ArXiv, abs/2509.17786, 2025. 2
-
[23]
Whoever started the interference should end it: Guiding data-free model merging via task vectors
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. In Forty-second International Conference on Machine Learn- ing, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025,
work page 2025
-
[24]
Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024
Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Gon- charova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dim- itrov, and Andrey Kuznetsov. Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024. 2
-
[25]
Localizing task in- formation for improved model merging and compression
Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jim ´enez, Franccois Fleuret, and Pascal Frossard. Localizing task in- formation for improved model merging and compression. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 7
work page 2024
-
[26]
Revisiting weight averaging for model merging
Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. ArXiv, abs/2412.12153, 2024. 6, 7
-
[27]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 6
work page 2013
-
[28]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 6
work page 2014
-
[29]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6
work page 2019
-
[30]
The german traffic sign recognition bench- mark: a multi-class classification competition
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition bench- mark: a multi-class classification competition. InThe 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011. 6
work page 2011
-
[31]
Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 2002. 6
work page 2002
-
[32]
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6
work page 2017
-
[33]
Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Tor- ralba, and Aude Oliva. Sun database: Exploring a large col- lection of scene categories.International Journal of Com- puter Vision, 119(1):3–22, 2016. 6
work page 2016
-
[34]
Reading digits in natural images with unsupervised feature learning
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learn- ing, volume 2011, page 7. Granada, 2011. 6
work page 2011
-
[35]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report TR- 2009, University of Toronto, Toronto, Ontario, 2009. 6
work page 2009
-
[36]
An analysis of single-layer networks in unsupervised feature learning
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 6
work page 2011
-
[37]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 6
work page 2008
-
[38]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 6
work page 2012
-
[39]
Rotation equivariant cnns for digital pathology
Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Co- hen, and Max Welling. Rotation equivariant cnns for digital pathology. InInternational Conference on Medical image computing and computer-assisted intervention, pages 210–
-
[40]
Chal- lenges in representation learning: A report on three machine learning contests
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 6
work page 2013
-
[41]
Emnist: Extending mnist to handwritten letters
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017. 6
work page 2017
-
[42]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 6
work page 2014
-
[43]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion- mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Recursive deep models for semantic compositional- ity over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositional- ity over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language pro- cessing, pages 1631–1642, 2013. 6
work page 2013
-
[45]
Deep Learning for Classical Japanese Literature
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature.arXiv preprint arXiv:1812.01718, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Glue: A multi-task benchmark and analysis platform for natural language un- derstanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language un- derstanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018. 6
work page 2018
-
[47]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[48]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6
work page 2021
-
[49]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.ArXiv, abs/1907.11692, 2019. 6
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[50]
Language models are unsu- pervised multitask learners.OpenAI blog, 2019
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019. 6
work page 2019
-
[51]
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024. 6
work page 2024
-
[52]
Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024
Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024. 6
-
[53]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.ArXiv, abs/2302.13971, 2023. 6 A. Theoretical Derivations A.1. De...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.