arxiv: 2605.13936 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.DC

Recognition: no theorem link

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Daniel M. Jimenez-Gutierrez , Enrique Zuazua , Georgios Kellaris , Joaquin Del Rio , Oleksii Sliusarenko , Xabi Uribe-Etxebarria

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords federated learningLLM fine-tuningprivate dataPEFTnon-IIDhealthcare NLPfinancial NLPLoRA

0 comments

The pith

Federated fine-tuning lets LLMs adapt to private institutional data in healthcare and finance while matching centralized training performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models can be jointly fine-tuned across separate institutions holding private data without any data exchange. It evaluates this on four closed-ended tasks drawn from medical and financial domains under data partitions that reflect real differences in patient populations, documentation styles, and label distributions. Parameter-efficient methods keep the process practical on distributed hardware. A sympathetic reader would see this as a route to stronger domain-specific LLMs that respect regulatory barriers. The work shows the federated route closes most of the gap to pooled training and beats training on any single institution's data alone.

Core claim

Federated fine-tuning of pretrained LLMs using LoRA, QLoRA, and IA3 across non-IID institutional silos achieves accuracy close to centralized training on MedQA, MedMCQA, FPB, and FiQA-SA while clearly surpassing isolated single-site fine-tuning.

What carries the argument

A federated fine-tuning framework that coordinates parameter-efficient updates (LoRA, QLoRA, IA3) across nodes without moving raw data, tested on four QA and classification datasets under controlled non-IID splits.

If this is right

Private data in regulated sectors becomes usable for LLM adaptation without violating privacy rules.
QLoRA and IA3 deliver most of the accuracy of full fine-tuning at lower communication and compute cost in the federated setting.
Collaboration across institutions improves results over any single institution training alone.
The approach scales to cross-domain benchmarks without requiring identical data distributions at each site.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar federated setups could be tested on other regulated domains such as legal documents or government records if comparable non-IID patterns appear.
Combining the method with differential privacy or secure aggregation might further strengthen privacy guarantees while preserving the observed accuracy.
The efficiency gains from QLoRA and IA3 suggest the same techniques could reduce the carbon cost of distributed LLM training in other settings.
Future benchmarks could measure how performance changes when the number of participating institutions or the degree of data imbalance increases.

Load-bearing premise

The synthetic non-IID partitions and the four chosen datasets are representative of the heterogeneity that actually exists across real hospitals and financial institutions.

What would settle it

A live multi-institution deployment in which the federated model's accuracy on held-out test sets falls more than a few points below the accuracy obtained by centralized training on the same tasks.

Figures

Figures reproduced from arXiv: 2605.13936 by Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria.

**Figure 2.** Figure 2: Overview of the simplified LLM fine-tuning process from pre-training to domain-specific adaptation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Classical architecture for centralized training. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Proposed architecture for federated fine-tuning with privacy-preserving orchestration. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Label distribution across institutions (INS) for the non-IID partitions used in each dataset. Each stacked bar [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy for the Single-institution, Centralized, and Federated scenarios for the best model for the [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy for the Single-institution, Centralized, and Federated scenarios for a representative [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Memory footprint (GB) in the Federated scenario for the five best-performing models under QLoRA. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world's most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the Sherpa.ai Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a federated fine-tuning framework for LLMs on private institutional data in healthcare and finance, built on the Sherpa.ai platform. Using four datasets (MedQA, MedMCQA, FPB, FiQA-SA) and three PEFT methods (LoRA, QLoRA, IA3), it evaluates performance under non-IID partitions that simulate institutional heterogeneity. The central empirical claim is that federated fine-tuning achieves performance close to centralized training while outperforming isolated single-institution learning, with additional efficiency benefits from QLoRA and IA3.

Significance. If the quantitative ordering holds under more rigorous statistical controls, the work would be significant for demonstrating a practical path to adapting LLMs on siloed private data without direct sharing. It supplies a cross-domain benchmark that directly compares federated, centralized, and isolated regimes, and it highlights Green-AI trade-offs via parameter-efficient methods. These elements address a timely gap between public-data LLM scaling and regulated-domain constraints.

major comments (3)

[Abstract and Experimental Results] Abstract and results section: the claim that federated fine-tuning 'performs close to centralized training' is presented without error bars, confidence intervals, or statistical significance tests across the four datasets. This leaves the quantitative support for the central ordering only moderately grounded, as noted in the soundness assessment.
[Experimental Setup] §4 (Experimental Setup): the non-IID partitions are generated via dataset splitting and client assignment. The manuscript does not include ablation or diagnostic experiments that quantify how well these partitions reproduce real institutional differences in modalities, documentation patterns, or population-level label skew. If the induced heterogeneity is milder than authentic silos, the observed closeness to centralized training may not generalize.
[Results] Results tables/figures: full hyperparameter schedules, random seeds, and training curves are not reported. Without these details it is difficult to reproduce the exact federated-versus-centralized gaps or to assess sensitivity to the chosen non-IID degree.

minor comments (2)

[Methods] Notation for the three PEFT variants (LoRA, QLoRA, IA3) should be introduced with a brief equation or reference in the methods section for readers unfamiliar with the specific adapters.
[Discussion] The Green-AI efficiency claims would benefit from explicit reporting of peak memory and FLOPs per method rather than qualitative statements.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and results section: the claim that federated fine-tuning 'performs close to centralized training' is presented without error bars, confidence intervals, or statistical significance tests across the four datasets. This leaves the quantitative support for the central ordering only moderately grounded, as noted in the soundness assessment.

Authors: We agree with the referee that providing error bars, confidence intervals, and statistical significance tests would better ground our central claims. In the revised manuscript, we will rerun the experiments with multiple random seeds, report mean performance with standard deviations, and include statistical tests (e.g., paired t-tests) to compare the federated, centralized, and isolated settings across the datasets. revision: yes
Referee: [Experimental Setup] §4 (Experimental Setup): the non-IID partitions are generated via dataset splitting and client assignment. The manuscript does not include ablation or diagnostic experiments that quantify how well these partitions reproduce real institutional differences in modalities, documentation patterns, or population-level label skew. If the induced heterogeneity is milder than authentic silos, the observed closeness to centralized training may not generalize.

Authors: The non-IID partitions are generated by label-based stratification and client assignment to simulate common forms of institutional heterogeneity, such as differences in label distributions, which is a widely used method in federated learning literature. We will expand the experimental setup section to provide more details on the partitioning process and include an ablation study that varies the degree of non-IIDness to demonstrate the robustness of our findings. We acknowledge that a full diagnostic comparison to real-world institutional data silos is not feasible within this study due to privacy regulations preventing access to such multi-site datasets; however, our approach aligns with standard benchmarks in the field. revision: partial
Referee: [Results] Results tables/figures: full hyperparameter schedules, random seeds, and training curves are not reported. Without these details it is difficult to reproduce the exact federated-versus-centralized gaps or to assess sensitivity to the chosen non-IID degree.

Authors: We will add the complete hyperparameter schedules, the specific random seeds employed for each experiment, and representative training curves to the appendix or as supplementary material in the revised version. This will facilitate reproducibility and allow readers to assess sensitivity to the non-IID configurations. revision: yes

standing simulated objections not resolved

The request for ablation or diagnostic experiments that quantify how well the non-IID partitions reproduce real institutional differences in modalities, documentation patterns, or population-level label skew, as this would require access to authentic multi-institutional private datasets which are unavailable due to privacy constraints.

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on direct performance comparisons

full rationale

The paper is an empirical benchmark study comparing federated PEFT (LoRA, QLoRA, IA3) against centralized and isolated baselines on four datasets under non-IID partitions. All reported results are observed accuracy/efficiency metrics from explicit training runs; no equations derive predictions from fitted parameters, no self-definitional quantities appear, and no load-bearing self-citations or uniqueness theorems are invoked. The non-IID construction via dataset partitioning is an explicit experimental choice whose outcomes are measured rather than presupposed, keeping the derivation chain self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard federated-learning assumptions about secure aggregation and on the representativeness of the chosen datasets and non-IID partitions; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Model updates can be aggregated without leaking private data
Implicit in the description of the Sherpa.ai Federated Learning platform and the federated fine-tuning framework.
domain assumption The four selected datasets and non-IID splits reflect realistic institutional heterogeneity
Stated as the evaluation setting but not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5651 in / 1275 out tokens · 48100 ms · 2026-05-15T04:52:35.149189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

[1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730--27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730--27744, 2022

work page 2022
[2]

Large language models in the clinic: a comprehensive benchmark.arXiv preprint arXiv:2405.00716, 2024

Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, et al. Large language models in the clinic: a comprehensive benchmark.arXiv preprint arXiv:2405.00716, 2024

work page arXiv 2024
[3]

Open finllm leaderboard: Towards financial ai readiness, 2025

Shengyuan Colin Lin, Felix Tian, Keyi Wang, Xingjian Zhao, Jimin Huang, Qianqian Xie, Luca Borella, Matt White, Christina Dan Wang, Kairong Xiao, Xiao-Yang Liu Yanglet, and Li Deng. Open finllm leaderboard: Towards financial ai readiness, 2025

work page 2025
[4]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[5]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088--10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088--10115, 2023

work page 2023
[6]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc

work page 2022
[7]

Llms meet finance: Fine-tuning foundation models for the open finllm leaderboard.arXiv preprint arXiv:2504.13125, 2025

Varun Rao, Youran Sun, Mahendra Kumar, Tejas Mutneja, Agastya Mukherjee, and Haizhao Yang. Llms meet finance: Fine-tuning foundation models for the open finllm leaderboard.arXiv preprint arXiv:2504.13125, 2025

work page arXiv 2025
[8]

Communication- efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273--1282. PMLR, 2017. 17 Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

work page 2017
[9]

Fedllm-bench: realistic benchmarks for federated learning of large language models

Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, and Siheng Chen. Fedllm-bench: realistic benchmarks for federated learning of large language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2024. Curran Associates Inc

work page 2024
[10]

Flowertune: A cross-domain benchmark for federated fine-tuning of large language models.arXiv preprint arXiv:2506.02961, 2025

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, et al. Flowertune: A cross-domain benchmark for federated fine-tuning of large language models.arXiv preprint arXiv:2506.02961, 2025

work page arXiv 2025
[11]

Yang, and Xiao-Yang Liu

Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y. Yang, and Xiao-Yang Liu. Finlora: Benchmarking lora methods for fine-tuning llms on financial datasets, 2025

work page 2025
[12]

On the impact of fine-tuning on chain-of-thought reasoning

Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the impact of fine-tuning on chain-of-thought reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11679--11698, 2025

work page 2025
[13]

Med42--evaluating fine-tuning strategies for medical llms: full-parameter vs

Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, et al. Med42--evaluating fine-tuning strategies for medical llms: full-parameter vs. parameter-efficient approaches.arXiv preprint arXiv:2404.14779, 2024

work page arXiv 2024
[14]

A comparative analysis of instruction fine-tuning large language models for financial text classification.ACM Trans

Sorouralsadat Fatemi, Yuheng Hu, and Maryam Mousavi. A comparative analysis of instruction fine-tuning large language models for financial text classification.ACM Trans. Manage. Inf. Syst., 16(1), February 2025

work page 2025
[15]

Finlora: Finetuning quantized financial large language models using low-rank adaptation.arXiv preprint arXiv:2412.11378, 2024

Dannong Wang, Daniel Kim, Bo Jin, Xingjian Zhao, Tianfan Fu, Steve Yang, and Xiao-Yang Liu. Finlora: Finetuning quantized financial large language models using low-rank adaptation.arXiv preprint arXiv:2412.11378, 2024

work page arXiv 2024
[16]

Federated fine-tuning of llms: Framework comparison and research directions.IEEE Communications Magazine, 63(10):52--58, 2025

Na Yan, Yang Su, Yansha Deng, and Robert Schober. Federated fine-tuning of llms: Framework comparison and research directions.IEEE Communications Magazine, 63(10):52--58, 2025

work page 2025
[17]

Flow of knowledge: Federated fine-tuning of llms in healthcare under non-iid conditions.arXiv preprint arXiv:2510.00543, 2025

Zeyu Chen, Yun Ji, Bowen Wang, Liwen Shi, Zijie Zeng, and Sheng Zhang. Flow of knowledge: Federated fine-tuning of llms in healthcare under non-iid conditions.arXiv preprint arXiv:2510.00543, 2025

work page arXiv 2025
[18]

Fedartml: A tool to facilitate the generation of non-iid datasets in a controlled way to support federated learning research

Daniel M Jimenez-Gutierrez, Aris Anagnostopoulos, Ioannis Chatzigiannakis, and Andrea Vitaletti. Fedartml: A tool to facilitate the generation of non-iid datasets in a controlled way to support federated learning research. IEEE Access, 2024

work page 2024
[19]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training, 2025

Hongzhou Yu, Tianhao Cheng, Ying Cheng, and Rui Feng. Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training, 2025

work page 2025
[21]

Scaling federated learning for fine-tuning of large language models

Agrin Hilmkil, Sebastian Callh, Matteo Barbieri, Leon René Sütfeld, Edvin Listo Zec, and Olof Mogren. Scaling federated learning for fine-tuning of large language models. InInternational Conference on Applications of Natural Language to Information Systems, pages 15--23. Springer, 2021

work page 2021
[22]

The open medical-llm leaderboard: Benchmarking large language models in healthcare

Aaditya Ura, Pasquale Minervini, and Clémentine Fourrier. The open medical-llm leaderboard: Benchmarking large language models in healthcare. https://huggingface.co/blog/leaderboard-medicalllm, April

work page
[23]

Hugging Face blog post, accessed 2026-04-10

work page 2026
[24]

Non-iid data in federated learning: A systematic review with taxonomy, metrics, methods, frameworks and future directions.arXiv e-prints, pages arXiv--2411, 2024

Daniel M Jimenez-Gutierrez, David Solans, Mikko Heikkila, Andrea Vitaletti, Nicolas Kourtellis, Aris Anagnos- topoulos, and Ioannis Chatzigiannakis. Non-iid data in federated learning: A systematic review with taxonomy, metrics, methods, frameworks and future directions.arXiv e-prints, pages arXiv--2411, 2024

work page 2024
[25]

Federated learning on non-iid data silos: An experimental study

Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data silos: An experimental study. In2022 IEEE 38th international conference on data engineering (ICDE), pages 965--978. IEEE, 2022

work page 2022
[26]

Federated learning based on dynamic regularization.arXiv preprint arXiv:2111.04263, 2021

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization.arXiv preprint arXiv:2111.04263, 2021

work page arXiv 2021
[27]

Www’18 open challenge: financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. InCompanion proceedings of the the web conference 2018, pages 1941--1942, 2018

work page 2018
[28]

Good debt or bad debt: Detecting semantic orientations in economic texts.Journal of the Association for Information Science and Technology, 65(4):782--796, 2014

Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts.Journal of the Association for Information Science and Technology, 65(4):782--796, 2014. 18 Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

work page 2014
[29]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[30]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248--260. PMLR, 2022

work page 2022
[31]

Ii-medical-8b: Medical reasoning model, 2025

Intelligent Internet. Ii-medical-8b: Medical reasoning model, 2025

work page 2025
[32]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[33]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[34]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ammar Al-Dahle, Adam Letman, Anant Mathur, Alan Schelten, Angela Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024

work page 2024
[36]

Gemma, 2024

Gemma Team. Gemma, 2024

work page 2024
[37]

Gemma 3, 2025

Gemma Team. Gemma 3, 2025

work page 2025
[38]

Green ai.Communications of the ACM, 63(12):54- -63, 2020

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai.Communications of the ACM, 63(12):54- -63, 2020. 19

work page 2020