pith. sign in

arxiv: 2503.08223 · v3 · submitted 2025-03-11 · 💻 cs.DC

Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

Pith reviewed 2026-05-23 01:00 UTC · model grok-4.3

classification 💻 cs.DC
keywords large language modelsscaling lawsedge computingdistributed learningfederated learningdata scarcityAI democratization
0
0 comments X

The pith

Massive edge devices can supply the data and compute to keep scaling large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scaling laws for foundation models are running into two hard limits: exhaustion of high-quality public data and the concentration of required compute power in the hands of a few large organizations. It identifies the vast unused data and processing capacity sitting on billions of edge devices as an alternative resource pool. Recent progress in distributed and federated learning is presented as the practical bridge that turns these scattered devices into a workable training fabric. If the approach holds, ordinary users with small devices could contribute directly to training large models rather than remaining passive consumers.

Core claim

By collaborating across massive numbers of edge devices, the two bottlenecks of data scarcity and centralized compute monopolies can be bypassed, enabling continued scaling of large language models through distributed training that lets anyone with a small device participate.

What carries the argument

Distributed and federated learning applied to the collective data and compute resources of massive edge devices, which together provide both additional training examples and parallel processing capacity without requiring single-site data centers.

If this is right

  • High-quality public data no longer sets an absolute ceiling because private data on devices becomes usable.
  • Compute requirements are spread so that participation is no longer restricted to organizations with massive clusters.
  • AI model development can involve a wider community, reducing concentration of control.
  • New coordination mechanisms for data privacy and device incentives become necessary parts of the training pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Coordination overhead and device heterogeneity could still limit the effective scale even if the basic feasibility claim holds.
  • The same edge resources might also support inference or fine-tuning workloads once the training paradigm is established.
  • Integration with existing cloud infrastructure would likely be required for orchestration rather than replacing it outright.

Load-bearing premise

Recent technical advances in distributed and federated learning are now sufficient to make reliable, efficient training across billions of heterogeneous edge devices practical.

What would settle it

A controlled large-scale trial in which models trained via edge collaboration achieve materially lower performance or higher effective cost than centralized training on the same total data and compute volume.

Figures

Figures reproduced from arXiv: 2503.08223 by Chao Wu, Didi Zhu, Fei Wu, Tao Shen, Zexi Li, Ziyu Zhao.

Figure 1
Figure 1. Figure 1: Trend of Computational Demand for Model Training. (Data source: [38]). Computational demand is growing expo￾nentially. As large-scale AI models like GPT-4 [4], Llama 3 [12], and DeepSeek￾V3 [11] surpass the trillion-parameter scale, the global AI landscape faces severe compu￾tational efficiency challenges. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Global data volume from 2014 to 2025 and IoT device data volume in 2015 and 2025. (Data sources: Global data volume from [43]; IoT device data volume from [44].) 5.0 5.6 5.9 6.3 6.6 7.0 7.2 7.4 7.6 7.8 8.0 2.6 3.5 4.7 7.4 11.2 16.5 23.7 33.4 46.7 64.4 87.9 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 4.5 5 5.5 6 6.5 7 7.5 8 8.5 0 20 40 60 80 100 120 Actual Smartphone Data Predicted Smartphone Dat… view at source ↗
Figure 5
Figure 5. Figure 5: Smartphone Market Share and Comput￾ing Power Trends. (Data source: [55]). Edge computing has potential for LLM training. We analyze the performance of smartphone chips, representing typical edge devices, and estimated their overall computing power. To ensure our estimation is as accurate as possible, we based our calculations on the market share data from [55]. We then estimated the total computing power o… view at source ↗
Figure 6
Figure 6. Figure 6: Train Large Language Models with Small Edge Devices [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This position paper argues that LLM scaling laws face two barriers—depletion of high-quality public data and monopolization of compute by tech giants—and proposes that massive edge devices can overcome them by providing untapped data and compute resources. It reviews recent advances in distributed and federated learning to claim that collaborative training on small edge devices is now viable, enabling broad participation and democratizing AI development.

Significance. If the reviewed literature indeed establishes viability, the position could meaningfully shift AI development toward inclusive, decentralized paradigms by exploiting edge resources. The manuscript contains no new empirical results, derivations, quantitative scaling projections, or falsifiable predictions, so any significance rests entirely on the interpretive synthesis of prior work rather than original technical contributions.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.
minor comments (1)
  1. The introduction and review sections would benefit from explicit demarcation between synthesized prior results and any original interpretive claims to improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review of our position paper. We appreciate the acknowledgment that the work is a synthesis of prior literature rather than an empirical study. Below we respond directly to the single major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.

    Authors: We agree that the manuscript performs no new quantitative analysis, scaling-law extrapolation, or falsifiable predictions of its own; this is inherent to its nature as a position paper whose contribution is interpretive synthesis. The viability claim is explicitly grounded in the cited body of recent distributed and federated learning literature that the paper reviews (e.g., advances addressing communication efficiency, heterogeneity, and privacy that were previously limiting factors for edge-scale training). To make this grounding more transparent to readers, we will revise the abstract and expand the relevant sections to include concise, paper-internal summaries of the quantitative results reported in the key referenced works, thereby strengthening the link between external evidence and the position without altering the paper's scope or adding original experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a position paper whose argument consists of an interpretive review of external distributed/federated learning literature. No equations, derivations, fitted parameters, or quantitative predictions are present that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central claim simply asserts viability based on cited external advancements; no load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested assumption that current federated-learning methods can be extended to the scale, heterogeneity, and incentive structures of consumer edge devices without new fundamental obstacles.

axioms (2)
  • domain assumption Edge devices collectively possess sufficient high-quality data and idle compute to substitute for centralized resources.
    Invoked in the abstract when stating the vast untapped potential of data and computational resources on massive edge devices.
  • domain assumption Technical advancements in distributed/federated learning are already sufficient to make large-scale edge collaboration practical.
    Stated directly in the abstract as the basis for viability.

pith-pipeline@v0.9.0 · 5697 in / 1218 out tokens · 24885 ms · 2026-05-23T01:00:00.898941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

    cs.LG 2025-07 unverdicted novelty 7.0

    A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.

  2. LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

    cs.LG 2026-01 unverdicted novelty 3.0

    A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

Reference graph

Works this paper leans on

180 extracted references · 180 canonical work pages · cited by 2 Pith papers · 35 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  3. [4]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  4. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877– 1901, 2020

  5. [6]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  6. [7]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc V Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021

  7. [9]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009

  8. [10]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  9. [11]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  10. [12]

    Introducing llama 3.1: Our most capable models to date

    Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/ , 2024. Accessed: 2025-01-22

  11. [13]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  12. [14]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020

  13. [15]

    Introduction to federated learning

    DeepLearning.AI. Introduction to federated learning. https://www.deeplearning.ai/ short-courses/intro-to-federated-learning/ , 2024. Accessed: 2025-02-23

  14. [16]

    Deduplicating Training Data Makes Language Models Better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

  15. [17]

    Data governance in the age of large language models

    Stella Biderman, Kieran Schoelkopf, Anthony Weiss, and David Noever. Data governance in the age of large language models. arXiv preprint arXiv:2211.09911, 2022. 10

  16. [18]

    Position: Will we run out of data? limits of llm scaling based on human-generated data

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. In International Conference on Machine Learning, pages 49523–49544. PMLR, 2024

  17. [19]

    On the diversity of synthetic data and its impact on training large language models

    Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I Abdin. On the diversity of synthetic data and its impact on training large language models. arXiv preprint arXiv:2410.15226, 2024

  18. [20]

    Ai produces gibberish when trained on too much ai-generated data, 2024

    Emily Wenger. Ai produces gibberish when trained on too much ai-generated data, 2024

  19. [21]

    Bias of ai-generated content: an examination of news produced by large language models

    Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. Bias of ai-generated content: an examination of news produced by large language models. Scientific Reports, 14(1):5224, 2024

  20. [22]

    General data protection regulation

    Protection Regulation. General data protection regulation. Intouch, 25:1–5, 2018

  21. [23]

    Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024

    Dean Hardy-White. Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024. Ac- cessed: 2025-01-22

  22. [24]

    Introducing grok-3

    xAI. Introducing grok-3. https://x.ai/blog/grok-3, 2025. Accessed: 2025-02-23

  23. [25]

    Deep learning’s diminishing returns

    Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. Deep learning’s diminishing returns. IEEE Spectrum, 58(10):50–55, 2021

  24. [26]

    Sharir, B

    Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900, 2020

  25. [27]

    Green ai

    Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63, 2020

  26. [28]

    Artificial intelligence and competition policy

    Andrei Hagiu and Julian Wright. Artificial intelligence and competition policy. International Journal of Industrial Organization, page 103134, 2025

  27. [29]

    Frontier ai regulation: Managing emerging risks to public safety

    Jack Thompson, Amanda Askell, and Jeffrey Song. Frontier ai regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2207.05257, 2022

  28. [30]

    Trends in training dataset sizes

    Pablo Villalobos and Anson Ho. Trends in training dataset sizes. Epoch AI Blog, 2022

  29. [31]

    URLhttps: //aclanthology.org/2022.lrec-1.777/

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. arXiv preprint arXiv:2211.04325, pages 13–29, 2024

  30. [32]

    Compute trends across three eras of machine learning

    Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022

  31. [33]

    Tinygsm: Achieving 80% on gsm8k with small models

    Bingbin Liu, Sébastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: Achieving 80% on gsm8k with small models. arXiv preprint arXiv:2312.09237, 2023

  32. [34]

    The Curse of Recursion: Training on Generated Data Makes Models Forget

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023

  33. [35]

    Strong model collapse

    Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024

  34. [36]

    Self-consuming generative models go mad

    Sina Alemohammad, Jose Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023

  35. [37]

    Scaling laws of synthetic images for model training

    Li Fan, Kaiming Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yuandong Tian. Scaling laws of synthetic images for model training. arXiv preprint arXiv:2306.09387, 2023. 11

  36. [38]

    Trends in machine learning hardware,

    Marius Hobbhahn, Lennart Heim, and Gökçe Aydos. Trends in machine learning hardware,

  37. [39]

    Accessed: 2025-01-27

  38. [40]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  39. [41]

    The end of moore’s law? innovation in computer systems continues

    Henry Kressel. The end of moore’s law? innovation in computer systems continues. Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research, 2023

  40. [42]

    Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars

    Benzinga Staff. Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars. Benzinga, June 2024

  41. [43]

    Ai’s hardware hunger: The global semiconductor supply chain under pressure

    ScaleFlux Research. Ai’s hardware hunger: The global semiconductor supply chain under pressure. ScaleFlux Insights, 2024. Accessed: 2025-01-27

  42. [44]

    V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

    Statista global data volume. V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

  43. [45]

    Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

    Statista IoT device data volume. Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

  44. [46]

    Edge computing market size & share analysis report, 2023-2030, 2023

    Grand View Research. Edge computing market size & share analysis report, 2023-2030, 2023

  45. [47]

    How many smartphones are in the world?, 2023

    BankMyCell. How many smartphones are in the world?, 2023

  46. [48]

    Dataage white paper: The digitization of the world – from edge to core, 2019

    Seagate. Dataage white paper: The digitization of the world – from edge to core, 2019

  47. [49]

    Rethink data report 2020, 2020

    Seagate. Rethink data report 2020, 2020

  48. [50]

    A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications

    Sabuzima Nayak, Ripon Patgiri, Lilapati Waikhom, and Arif Ahmed. A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications. Digital Communications and Networks, 10(3):783–804, 2024

  49. [51]

    Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023

    Cavli Wireless. Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023. Accessed:2025-01-22

  50. [52]

    Small language model as data prospector for large language model

    Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, and Min Yang. Small language model as data prospector for large language model. arXiv preprint arXiv:2412.09990, 2024

  51. [53]

    iphone 16 pro and 16 pro max - technical specifications, 2024

    Apple Inc. iphone 16 pro and 16 pro max - technical specifications, 2024

  52. [54]

    Nvidia jetson agx orin tflops specifications, 2023

    NVIDIA. Nvidia jetson agx orin tflops specifications, 2023. Forum discussion clarifying sparse vs. dense TFLOPS

  53. [55]

    NanoReview.net - Gadget Specifications and Comparisons

    NanoReview.net. NanoReview.net - Gadget Specifications and Comparisons. https:// nanoreview.net, 2025. Accessed: 2025-02-23

  54. [56]

    Canalys Newsroom - Market Analysis and Research

    Canalys. Canalys Newsroom - Market Analysis and Research. https://canalys.com/ newsroom, 2025. Accessed: 2025-02-23

  55. [57]

    Small language models: Survey, measurements, and insights

    Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790, 2024

  56. [58]

    A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness

    Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness. arXiv preprint arXiv:2411.03350, 2024

  57. [59]

    A survey of small language models

    Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, et al. A survey of small language models. arXiv preprint arXiv:2410.20011, 2024. 12

  58. [60]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2020

  59. [61]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2020

  60. [62]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  61. [63]

    The zamba2 suite: Technical report

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report. arXiv preprint arXiv:2411.15242, 2024

  62. [64]

    Hymba: A hybrid-head architecture for small language models

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck Bilicki, Ziyang Ma, Qingyao Ai, et al. Hymba: A hybrid-head architecture for small language models. arXiv preprint arXiv:2411.13676, 2024

  63. [65]

    xlstm: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

  64. [66]

    SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023

  65. [67]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Pedro Arcadinho, Eric Cao, Xin Cui, Zihang Dai, Jeff Eissman, Orhan Firat, Sophia Fu, Cong Gao, Yanping Hu, Maarten Hughes, James Kenealy, Maxim Krikun, Sneha Li, Yanping Li, Xiang Liu, Lianmin Luo, David McAllester, Matthew Olson, Alec Patel, Reiner Pope, Noam Rao, Alex Roberts, Noam Shazeer, Aditya S...

  66. [68]

    Paloma: A benchmark for evaluating language model fit

    Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Walsh, Yanai Elazar, Kyle Lo, et al. Paloma: A benchmark for evaluating language model fit. Advances in Neural Information Processing Systems , 37:64338–64376, 2024

  67. [69]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019

  68. [70]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019

  69. [71]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

  70. [72]

    Specializing smaller language models towards multi-step reasoning

    Yao Fu, Hao Peng, Ashish Khotilovich, Liang Chen, and Yan Yang. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023

  71. [73]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023

  72. [74]

    Exo: Run your own ai cluster at home with everyday devices

    Exo Labs. Exo: Run your own ai cluster at home with everyday devices. https://github. com/exo-explore/exo, 2025. Accessed: 2025-01-29. 13

  73. [75]

    Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang

    Yiping Kang, Johann Hauswald, Cao Gao, Andrew M. Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615–629. ACM, 2017

  74. [76]

    Yang, Jian Wu, and Meng Zhang

    Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, and Meng Zhang. Moe2: Optimizing collaborative inference for edge large language models.arXiv preprint arXiv:2501.09410, 2025. Submitted to IEEE/ACM Transactions on Networking

  75. [77]

    Edge intelligence: On-demand deep learning model co- inference with device-edge synergy

    En Li, Zhi Zhou, and Xu Chen. Edge intelligence: On-demand deep learning model co- inference with device-edge synergy. In Proceedings of the 2018 ACM/IEEE Symposium on Edge Computing, pages 31–46. IEEE, 2018

  76. [78]

    Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference

    Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference. arXiv preprint arXiv:2405.17245, 2024

  77. [79]

    On- device training under 256kb memory

    Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On- device training under 256kb memory. Advances in Neural Information Processing Systems, 35:22941–22954, 2022

  78. [80]

    Tinytl: Reduce memory, not parameters for efficient on-device learning

    Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. In Advances in Neural Information Processing Systems, volume 33, pages 11285–11297, 2020

  79. [81]

    Zerofl: Efficient on-device training for federated learning with local sparsity

    Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. Zerofl: Efficient on-device training for federated learning with local sparsity. In International Conference on Learning Representations, 2022

  80. [82]

    Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization

    Keisuke Sugiura and Hiroki Matsutani. Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv preprint arXiv:2501.04287, 2025

Showing first 80 references.