Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Pith reviewed 2026-05-23 01:00 UTC · model grok-4.3
The pith
Massive edge devices can supply the data and compute to keep scaling large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By collaborating across massive numbers of edge devices, the two bottlenecks of data scarcity and centralized compute monopolies can be bypassed, enabling continued scaling of large language models through distributed training that lets anyone with a small device participate.
What carries the argument
Distributed and federated learning applied to the collective data and compute resources of massive edge devices, which together provide both additional training examples and parallel processing capacity without requiring single-site data centers.
If this is right
- High-quality public data no longer sets an absolute ceiling because private data on devices becomes usable.
- Compute requirements are spread so that participation is no longer restricted to organizations with massive clusters.
- AI model development can involve a wider community, reducing concentration of control.
- New coordination mechanisms for data privacy and device incentives become necessary parts of the training pipeline.
Where Pith is reading between the lines
- Coordination overhead and device heterogeneity could still limit the effective scale even if the basic feasibility claim holds.
- The same edge resources might also support inference or fine-tuning workloads once the training paradigm is established.
- Integration with existing cloud infrastructure would likely be required for orchestration rather than replacing it outright.
Load-bearing premise
Recent technical advances in distributed and federated learning are now sufficient to make reliable, efficient training across billions of heterogeneous edge devices practical.
What would settle it
A controlled large-scale trial in which models trained via edge collaboration achieve materially lower performance or higher effective cost than centralized training on the same total data and compute volume.
Figures
read the original abstract
The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that LLM scaling laws face two barriers—depletion of high-quality public data and monopolization of compute by tech giants—and proposes that massive edge devices can overcome them by providing untapped data and compute resources. It reviews recent advances in distributed and federated learning to claim that collaborative training on small edge devices is now viable, enabling broad participation and democratizing AI development.
Significance. If the reviewed literature indeed establishes viability, the position could meaningfully shift AI development toward inclusive, decentralized paradigms by exploiting edge resources. The manuscript contains no new empirical results, derivations, quantitative scaling projections, or falsifiable predictions, so any significance rests entirely on the interpretive synthesis of prior work rather than original technical contributions.
major comments (1)
- [Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.
minor comments (1)
- The introduction and review sections would benefit from explicit demarcation between synthesized prior results and any original interpretive claims to improve traceability.
Simulated Author's Rebuttal
We thank the referee for the detailed review of our position paper. We appreciate the acknowledgment that the work is a synthesis of prior literature rather than an empirical study. Below we respond directly to the single major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.
Authors: We agree that the manuscript performs no new quantitative analysis, scaling-law extrapolation, or falsifiable predictions of its own; this is inherent to its nature as a position paper whose contribution is interpretive synthesis. The viability claim is explicitly grounded in the cited body of recent distributed and federated learning literature that the paper reviews (e.g., advances addressing communication efficiency, heterogeneity, and privacy that were previously limiting factors for edge-scale training). To make this grounding more transparent to readers, we will revise the abstract and expand the relevant sections to include concise, paper-internal summaries of the quantitative results reported in the key referenced works, thereby strengthening the link between external evidence and the position without altering the paper's scope or adding original experiments. revision: partial
Circularity Check
No significant circularity
full rationale
This is a position paper whose argument consists of an interpretive review of external distributed/federated learning literature. No equations, derivations, fitted parameters, or quantitative predictions are present that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central claim simply asserts viability based on cited external advancements; no load-bearing step matches any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Edge devices collectively possess sufficient high-quality data and idle compute to substitute for centralized resources.
- domain assumption Technical advancements in distributed/federated learning are already sufficient to make large-scale edge collaboration practical.
Forward citations
Cited by 2 Pith papers
-
On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning
A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[2]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877– 1901, 2020
work page 1901
-
[6]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc V Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009
work page 2009
-
[10]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Introducing llama 3.1: Our most capable models to date
Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/ , 2024. Accessed: 2025-01-22
work page 2024
-
[13]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020
work page 2020
-
[15]
Introduction to federated learning
DeepLearning.AI. Introduction to federated learning. https://www.deeplearning.ai/ short-courses/intro-to-federated-learning/ , 2024. Accessed: 2025-02-23
work page 2024
-
[16]
Deduplicating Training Data Makes Language Models Better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Data governance in the age of large language models
Stella Biderman, Kieran Schoelkopf, Anthony Weiss, and David Noever. Data governance in the age of large language models. arXiv preprint arXiv:2211.09911, 2022. 10
-
[18]
Position: Will we run out of data? limits of llm scaling based on human-generated data
Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. In International Conference on Machine Learning, pages 49523–49544. PMLR, 2024
work page 2024
-
[19]
On the diversity of synthetic data and its impact on training large language models
Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I Abdin. On the diversity of synthetic data and its impact on training large language models. arXiv preprint arXiv:2410.15226, 2024
-
[20]
Ai produces gibberish when trained on too much ai-generated data, 2024
Emily Wenger. Ai produces gibberish when trained on too much ai-generated data, 2024
work page 2024
-
[21]
Bias of ai-generated content: an examination of news produced by large language models
Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. Bias of ai-generated content: an examination of news produced by large language models. Scientific Reports, 14(1):5224, 2024
work page 2024
-
[22]
General data protection regulation
Protection Regulation. General data protection regulation. Intouch, 25:1–5, 2018
work page 2018
-
[23]
Dean Hardy-White. Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024. Ac- cessed: 2025-01-22
work page 2024
-
[24]
xAI. Introducing grok-3. https://x.ai/blog/grok-3, 2025. Accessed: 2025-02-23
work page 2025
-
[25]
Deep learning’s diminishing returns
Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. Deep learning’s diminishing returns. IEEE Spectrum, 58(10):50–55, 2021
work page 2021
- [26]
- [27]
-
[28]
Artificial intelligence and competition policy
Andrei Hagiu and Julian Wright. Artificial intelligence and competition policy. International Journal of Industrial Organization, page 103134, 2025
work page 2025
-
[29]
Frontier ai regulation: Managing emerging risks to public safety
Jack Thompson, Amanda Askell, and Jeffrey Song. Frontier ai regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2207.05257, 2022
-
[30]
Trends in training dataset sizes
Pablo Villalobos and Anson Ho. Trends in training dataset sizes. Epoch AI Blog, 2022
work page 2022
-
[31]
URLhttps: //aclanthology.org/2022.lrec-1.777/
Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. arXiv preprint arXiv:2211.04325, pages 13–29, 2024
-
[32]
Compute trends across three eras of machine learning
Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022
work page 2022
-
[33]
Tinygsm: Achieving 80% on gsm8k with small models
Bingbin Liu, Sébastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: Achieving 80% on gsm8k with small models. arXiv preprint arXiv:2312.09237, 2023
-
[34]
The Curse of Recursion: Training on Generated Data Makes Models Forget
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024
-
[36]
Self-consuming generative models go mad
Sina Alemohammad, Jose Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023
-
[37]
Scaling laws of synthetic images for model training
Li Fan, Kaiming Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yuandong Tian. Scaling laws of synthetic images for model training. arXiv preprint arXiv:2306.09387, 2023. 11
-
[38]
Trends in machine learning hardware,
Marius Hobbhahn, Lennart Heim, and Gökçe Aydos. Trends in machine learning hardware,
-
[39]
Accessed: 2025-01-27
work page 2025
-
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[41]
The end of moore’s law? innovation in computer systems continues
Henry Kressel. The end of moore’s law? innovation in computer systems continues. Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research, 2023
work page 2023
-
[42]
Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars
Benzinga Staff. Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars. Benzinga, June 2024
work page 2024
-
[43]
Ai’s hardware hunger: The global semiconductor supply chain under pressure
ScaleFlux Research. Ai’s hardware hunger: The global semiconductor supply chain under pressure. ScaleFlux Insights, 2024. Accessed: 2025-01-27
work page 2024
-
[44]
Statista global data volume. V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023
work page 2010
-
[45]
Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023
Statista IoT device data volume. Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023
work page 2019
-
[46]
Edge computing market size & share analysis report, 2023-2030, 2023
Grand View Research. Edge computing market size & share analysis report, 2023-2030, 2023
work page 2023
-
[47]
How many smartphones are in the world?, 2023
BankMyCell. How many smartphones are in the world?, 2023
work page 2023
-
[48]
Dataage white paper: The digitization of the world – from edge to core, 2019
Seagate. Dataage white paper: The digitization of the world – from edge to core, 2019
work page 2019
- [49]
-
[50]
Sabuzima Nayak, Ripon Patgiri, Lilapati Waikhom, and Arif Ahmed. A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications. Digital Communications and Networks, 10(3):783–804, 2024
work page 2024
-
[51]
Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023
Cavli Wireless. Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023. Accessed:2025-01-22
work page 2023
-
[52]
Small language model as data prospector for large language model
Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, and Min Yang. Small language model as data prospector for large language model. arXiv preprint arXiv:2412.09990, 2024
-
[53]
iphone 16 pro and 16 pro max - technical specifications, 2024
Apple Inc. iphone 16 pro and 16 pro max - technical specifications, 2024
work page 2024
-
[54]
Nvidia jetson agx orin tflops specifications, 2023
NVIDIA. Nvidia jetson agx orin tflops specifications, 2023. Forum discussion clarifying sparse vs. dense TFLOPS
work page 2023
-
[55]
NanoReview.net - Gadget Specifications and Comparisons
NanoReview.net. NanoReview.net - Gadget Specifications and Comparisons. https:// nanoreview.net, 2025. Accessed: 2025-02-23
work page 2025
-
[56]
Canalys Newsroom - Market Analysis and Research
Canalys. Canalys Newsroom - Market Analysis and Research. https://canalys.com/ newsroom, 2025. Accessed: 2025-02-23
work page 2025
-
[57]
Small language models: Survey, measurements, and insights
Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790, 2024
-
[58]
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness. arXiv preprint arXiv:2411.03350, 2024
-
[59]
A survey of small language models
Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, et al. A survey of small language models. arXiv preprint arXiv:2410.20011, 2024. 12
-
[60]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2020
-
[61]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[62]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
The zamba2 suite: Technical report
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report. arXiv preprint arXiv:2411.15242, 2024
-
[64]
Hymba: A hybrid-head architecture for small language models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck Bilicki, Ziyang Ma, Qingyao Ai, et al. Hymba: A hybrid-head architecture for small language models. arXiv preprint arXiv:2411.13676, 2024
-
[65]
xlstm: Extended long short-term memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024
-
[66]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023
work page 2023
-
[67]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Pedro Arcadinho, Eric Cao, Xin Cui, Zihang Dai, Jeff Eissman, Orhan Firat, Sophia Fu, Cong Gao, Yanping Hu, Maarten Hughes, James Kenealy, Maxim Krikun, Sneha Li, Yanping Li, Xiang Liu, Lianmin Luo, David McAllester, Matthew Olson, Alec Patel, Reiner Pope, Noam Rao, Alex Roberts, Noam Shazeer, Aditya S...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Paloma: A benchmark for evaluating language model fit
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Walsh, Yanai Elazar, Kyle Lo, et al. Paloma: A benchmark for evaluating language model fit. Advances in Neural Information Processing Systems , 37:64338–64376, 2024
work page 2024
-
[69]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[70]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[71]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[72]
Specializing smaller language models towards multi-step reasoning
Yao Fu, Hao Peng, Ashish Khotilovich, Liang Chen, and Yan Yang. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023
-
[73]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Exo: Run your own ai cluster at home with everyday devices
Exo Labs. Exo: Run your own ai cluster at home with everyday devices. https://github. com/exo-explore/exo, 2025. Accessed: 2025-01-29. 13
work page 2025
-
[75]
Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang
Yiping Kang, Johann Hauswald, Cao Gao, Andrew M. Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615–629. ACM, 2017
work page 2017
-
[76]
Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, and Meng Zhang. Moe2: Optimizing collaborative inference for edge large language models.arXiv preprint arXiv:2501.09410, 2025. Submitted to IEEE/ACM Transactions on Networking
-
[77]
Edge intelligence: On-demand deep learning model co- inference with device-edge synergy
En Li, Zhi Zhou, and Xu Chen. Edge intelligence: On-demand deep learning model co- inference with device-edge synergy. In Proceedings of the 2018 ACM/IEEE Symposium on Edge Computing, pages 31–46. IEEE, 2018
work page 2018
-
[78]
Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference
Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference. arXiv preprint arXiv:2405.17245, 2024
-
[79]
On- device training under 256kb memory
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On- device training under 256kb memory. Advances in Neural Information Processing Systems, 35:22941–22954, 2022
work page 2022
-
[80]
Tinytl: Reduce memory, not parameters for efficient on-device learning
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. In Advances in Neural Information Processing Systems, volume 33, pages 11285–11297, 2020
work page 2020
-
[81]
Zerofl: Efficient on-device training for federated learning with local sparsity
Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. Zerofl: Efficient on-device training for federated learning with local sparsity. In International Conference on Learning Representations, 2022
work page 2022
-
[82]
Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization
Keisuke Sugiura and Hiroki Matsutani. Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv preprint arXiv:2501.04287, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.