Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

Chao Wu; Didi Zhu; Fei Wu; Tao Shen; Zexi Li; Ziyu Zhao

arxiv: 2503.08223 · v3 · submitted 2025-03-11 · 💻 cs.DC

Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

Tao Shen , Didi Zhu , Ziyu Zhao , Zexi Li , Chao Wu , Fei Wu This is my paper

Pith reviewed 2026-05-23 01:00 UTC · model grok-4.3

classification 💻 cs.DC

keywords large language modelsscaling lawsedge computingdistributed learningfederated learningdata scarcityAI democratization

0 comments

The pith

Massive edge devices can supply the data and compute to keep scaling large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scaling laws for foundation models are running into two hard limits: exhaustion of high-quality public data and the concentration of required compute power in the hands of a few large organizations. It identifies the vast unused data and processing capacity sitting on billions of edge devices as an alternative resource pool. Recent progress in distributed and federated learning is presented as the practical bridge that turns these scattered devices into a workable training fabric. If the approach holds, ordinary users with small devices could contribute directly to training large models rather than remaining passive consumers.

Core claim

By collaborating across massive numbers of edge devices, the two bottlenecks of data scarcity and centralized compute monopolies can be bypassed, enabling continued scaling of large language models through distributed training that lets anyone with a small device participate.

What carries the argument

Distributed and federated learning applied to the collective data and compute resources of massive edge devices, which together provide both additional training examples and parallel processing capacity without requiring single-site data centers.

If this is right

High-quality public data no longer sets an absolute ceiling because private data on devices becomes usable.
Compute requirements are spread so that participation is no longer restricted to organizations with massive clusters.
AI model development can involve a wider community, reducing concentration of control.
New coordination mechanisms for data privacy and device incentives become necessary parts of the training pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Coordination overhead and device heterogeneity could still limit the effective scale even if the basic feasibility claim holds.
The same edge resources might also support inference or fine-tuning workloads once the training paradigm is established.
Integration with existing cloud infrastructure would likely be required for orchestration rather than replacing it outright.

Load-bearing premise

Recent technical advances in distributed and federated learning are now sufficient to make reliable, efficient training across billions of heterogeneous edge devices practical.

What would settle it

A controlled large-scale trial in which models trained via edge collaboration achieve materially lower performance or higher effective cost than centralized training on the same total data and compute volume.

Figures

Figures reproduced from arXiv: 2503.08223 by Chao Wu, Didi Zhu, Fei Wu, Tao Shen, Zexi Li, Ziyu Zhao.

**Figure 1.** Figure 1: Trend of Computational Demand for Model Training. (Data source: [38]). Computational demand is growing exponentially. As large-scale AI models like GPT-4 [4], Llama 3 [12], and DeepSeekV3 [11] surpass the trillion-parameter scale, the global AI landscape faces severe computational efficiency challenges. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Global data volume from 2014 to 2025 and IoT device data volume in 2015 and 2025. (Data sources: Global data volume from [43]; IoT device data volume from [44].) 5.0 5.6 5.9 6.3 6.6 7.0 7.2 7.4 7.6 7.8 8.0 2.6 3.5 4.7 7.4 11.2 16.5 23.7 33.4 46.7 64.4 87.9 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 4.5 5 5.5 6 6.5 7 7.5 8 8.5 0 20 40 60 80 100 120 Actual Smartphone Data Predicted Smartphone Dat… view at source ↗

**Figure 5.** Figure 5: Smartphone Market Share and Computing Power Trends. (Data source: [55]). Edge computing has potential for LLM training. We analyze the performance of smartphone chips, representing typical edge devices, and estimated their overall computing power. To ensure our estimation is as accurate as possible, we based our calculations on the market share data from [55]. We then estimated the total computing power o… view at source ↗

**Figure 6.** Figure 6: Train Large Language Models with Small Edge Devices [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that restates the case for federated edge training of LLMs but adds no new evidence, calculations, or technical results.

read the letter

The main thing to know is that the paper argues massive numbers of edge devices can overcome the data depletion and compute concentration problems in LLM scaling by drawing on federated learning advances, yet it presents no original analysis to support that claim. It frames the two bottlenecks clearly in the opening sections and notes the untapped resources on phones and other devices. The review of recent distributed learning work is straightforward and covers relevant topics like handling non-IID data and aggregation improvements. That part is competent synthesis and gives credit to existing literature without overclaiming. The central argument stays at the level of pointing to those prior advances as making edge-scale training viable. No new algorithms, datasets, or derivations appear. There are also no back-of-the-envelope estimates for total available compute, communication overhead at LLM sizes, or expected model quality under real edge conditions. The viability claim therefore rests entirely on the cited external work. The paper does not engage counterpoints such as device energy drain, participation incentives, or regulatory barriers. Citations follow standard patterns for a survey-style position piece and do not show unusual gaps or self-promotion. This is aimed at readers already thinking about AI access, governance, or decentralization rather than practitioners needing concrete methods. A systems or optimization researcher will not find implementable details or falsifiable predictions. It could reasonably go to peer review in a venue that accepts position papers on architectural or societal shifts in AI, since the framing is coherent and the literature review is accurate, even though the piece itself contains no load-bearing technical contribution that needs referee scrutiny.

Referee Report

1 major / 1 minor

Summary. This position paper argues that LLM scaling laws face two barriers—depletion of high-quality public data and monopolization of compute by tech giants—and proposes that massive edge devices can overcome them by providing untapped data and compute resources. It reviews recent advances in distributed and federated learning to claim that collaborative training on small edge devices is now viable, enabling broad participation and democratizing AI development.

Significance. If the reviewed literature indeed establishes viability, the position could meaningfully shift AI development toward inclusive, decentralized paradigms by exploiting edge resources. The manuscript contains no new empirical results, derivations, quantitative scaling projections, or falsifiable predictions, so any significance rests entirely on the interpretive synthesis of prior work rather than original technical contributions.

major comments (1)

[Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.

minor comments (1)

The introduction and review sections would benefit from explicit demarcation between synthesized prior results and any original interpretive claims to improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review of our position paper. We appreciate the acknowledgment that the work is a synthesis of prior literature rather than an empirical study. Below we respond directly to the single major comment.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'recent technical advancements in distributed/federated learning ... make this new paradigm viable' is presented without any manuscript-internal quantitative analysis, scaling-law extrapolation, or independent falsifiable prediction; the viability argument therefore reduces to an untested assertion about external literature.

Authors: We agree that the manuscript performs no new quantitative analysis, scaling-law extrapolation, or falsifiable predictions of its own; this is inherent to its nature as a position paper whose contribution is interpretive synthesis. The viability claim is explicitly grounded in the cited body of recent distributed and federated learning literature that the paper reviews (e.g., advances addressing communication efficiency, heterogeneity, and privacy that were previously limiting factors for edge-scale training). To make this grounding more transparent to readers, we will revise the abstract and expand the relevant sections to include concise, paper-internal summaries of the quantitative results reported in the key referenced works, thereby strengthening the link between external evidence and the position without altering the paper's scope or adding original experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a position paper whose argument consists of an interpretive review of external distributed/federated learning literature. No equations, derivations, fitted parameters, or quantitative predictions are present that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central claim simply asserts viability based on cited external advancements; no load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested assumption that current federated-learning methods can be extended to the scale, heterogeneity, and incentive structures of consumer edge devices without new fundamental obstacles.

axioms (2)

domain assumption Edge devices collectively possess sufficient high-quality data and idle compute to substitute for centralized resources.
Invoked in the abstract when stating the vast untapped potential of data and computational resources on massive edge devices.
domain assumption Technical advancements in distributed/federated learning are already sufficient to make large-scale edge collaboration practical.
Stated directly in the abstract as the basis for viability.

pith-pipeline@v0.9.0 · 5697 in / 1218 out tokens · 24885 ms · 2026-05-23T01:00:00.898941+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning
cs.LG 2025-07 unverdicted novelty 7.0

A single global merge at the final step of decentralized SGD matches the convergence rate of parallel SGD while improving test accuracy under high data heterogeneity.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
cs.LG 2026-01 unverdicted novelty 3.0

A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

Reference graph

Works this paper leans on

180 extracted references · 180 canonical work pages · cited by 2 Pith papers · 35 internal anchors

[1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877– 1901, 2020

work page 1901
[6]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc V Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009

work page 2009
[10]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Introducing llama 3.1: Our most capable models to date

Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/ , 2024. Accessed: 2025-01-22

work page 2024
[13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020

work page 2020
[15]

Introduction to federated learning

DeepLearning.AI. Introduction to federated learning. https://www.deeplearning.ai/ short-courses/intro-to-federated-learning/ , 2024. Accessed: 2025-02-23

work page 2024
[16]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Data governance in the age of large language models

Stella Biderman, Kieran Schoelkopf, Anthony Weiss, and David Noever. Data governance in the age of large language models. arXiv preprint arXiv:2211.09911, 2022. 10

work page arXiv 2022
[18]

Position: Will we run out of data? limits of llm scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. In International Conference on Machine Learning, pages 49523–49544. PMLR, 2024

work page 2024
[19]

On the diversity of synthetic data and its impact on training large language models

Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I Abdin. On the diversity of synthetic data and its impact on training large language models. arXiv preprint arXiv:2410.15226, 2024

work page arXiv 2024
[20]

Ai produces gibberish when trained on too much ai-generated data, 2024

Emily Wenger. Ai produces gibberish when trained on too much ai-generated data, 2024

work page 2024
[21]

Bias of ai-generated content: an examination of news produced by large language models

Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. Bias of ai-generated content: an examination of news produced by large language models. Scientific Reports, 14(1):5224, 2024

work page 2024
[22]

General data protection regulation

Protection Regulation. General data protection regulation. Intouch, 25:1–5, 2018

work page 2018
[23]

Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024

Dean Hardy-White. Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024. Ac- cessed: 2025-01-22

work page 2024
[24]

Introducing grok-3

xAI. Introducing grok-3. https://x.ai/blog/grok-3, 2025. Accessed: 2025-02-23

work page 2025
[25]

Deep learning’s diminishing returns

Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. Deep learning’s diminishing returns. IEEE Spectrum, 58(10):50–55, 2021

work page 2021
[26]

Sharir, B

Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900, 2020

work page arXiv 2004
[27]

Green ai

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63, 2020

work page 2020
[28]

Artificial intelligence and competition policy

Andrei Hagiu and Julian Wright. Artificial intelligence and competition policy. International Journal of Industrial Organization, page 103134, 2025

work page 2025
[29]

Frontier ai regulation: Managing emerging risks to public safety

Jack Thompson, Amanda Askell, and Jeffrey Song. Frontier ai regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2207.05257, 2022

work page arXiv 2022
[30]

Trends in training dataset sizes

Pablo Villalobos and Anson Ho. Trends in training dataset sizes. Epoch AI Blog, 2022

work page 2022
[31]

URLhttps: //aclanthology.org/2022.lrec-1.777/

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. arXiv preprint arXiv:2211.04325, pages 13–29, 2024

work page arXiv 2024
[32]

Compute trends across three eras of machine learning

Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022

work page 2022
[33]

Tinygsm: Achieving 80% on gsm8k with small models

Bingbin Liu, Sébastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: Achieving 80% on gsm8k with small models. arXiv preprint arXiv:2312.09237, 2023

work page arXiv 2023
[34]

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Strong model collapse

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024

work page arXiv 2024
[36]

Self-consuming generative models go mad

Sina Alemohammad, Jose Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023

work page arXiv 2023
[37]

Scaling laws of synthetic images for model training

Li Fan, Kaiming Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yuandong Tian. Scaling laws of synthetic images for model training. arXiv preprint arXiv:2306.09387, 2023. 11

work page arXiv 2023
[38]

Trends in machine learning hardware,

Marius Hobbhahn, Lennart Heim, and Gökçe Aydos. Trends in machine learning hardware,

work page
[39]

Accessed: 2025-01-27

work page 2025
[40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[41]

The end of moore’s law? innovation in computer systems continues

Henry Kressel. The end of moore’s law? innovation in computer systems continues. Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research, 2023

work page 2023
[42]

Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars

Benzinga Staff. Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars. Benzinga, June 2024

work page 2024
[43]

Ai’s hardware hunger: The global semiconductor supply chain under pressure

ScaleFlux Research. Ai’s hardware hunger: The global semiconductor supply chain under pressure. ScaleFlux Insights, 2024. Accessed: 2025-01-27

work page 2024
[44]

V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

Statista global data volume. V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

work page 2010
[45]

Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

Statista IoT device data volume. Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

work page 2019
[46]

Edge computing market size & share analysis report, 2023-2030, 2023

Grand View Research. Edge computing market size & share analysis report, 2023-2030, 2023

work page 2023
[47]

How many smartphones are in the world?, 2023

BankMyCell. How many smartphones are in the world?, 2023

work page 2023
[48]

Dataage white paper: The digitization of the world – from edge to core, 2019

Seagate. Dataage white paper: The digitization of the world – from edge to core, 2019

work page 2019
[49]

Rethink data report 2020, 2020

Seagate. Rethink data report 2020, 2020

work page 2020
[50]

A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications

Sabuzima Nayak, Ripon Patgiri, Lilapati Waikhom, and Arif Ahmed. A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications. Digital Communications and Networks, 10(3):783–804, 2024

work page 2024
[51]

Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023

Cavli Wireless. Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023. Accessed:2025-01-22

work page 2023
[52]

Small language model as data prospector for large language model

Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, and Min Yang. Small language model as data prospector for large language model. arXiv preprint arXiv:2412.09990, 2024

work page arXiv 2024
[53]

iphone 16 pro and 16 pro max - technical specifications, 2024

Apple Inc. iphone 16 pro and 16 pro max - technical specifications, 2024

work page 2024
[54]

Nvidia jetson agx orin tflops specifications, 2023

NVIDIA. Nvidia jetson agx orin tflops specifications, 2023. Forum discussion clarifying sparse vs. dense TFLOPS

work page 2023
[55]

NanoReview.net - Gadget Specifications and Comparisons

NanoReview.net. NanoReview.net - Gadget Specifications and Comparisons. https:// nanoreview.net, 2025. Accessed: 2025-02-23

work page 2025
[56]

Canalys Newsroom - Market Analysis and Research

Canalys. Canalys Newsroom - Market Analysis and Research. https://canalys.com/ newsroom, 2025. Accessed: 2025-02-23

work page 2025
[57]

Small language models: Survey, measurements, and insights

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790, 2024

work page arXiv 2024
[58]

A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness

Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness. arXiv preprint arXiv:2411.03350, 2024

work page arXiv 2024
[59]

A survey of small language models

Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, et al. A survey of small language models. arXiv preprint arXiv:2410.20011, 2024. 12

work page arXiv 2024
[60]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2020

work page arXiv 1909
[61]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1909
[62]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

The zamba2 suite: Technical report

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report. arXiv preprint arXiv:2411.15242, 2024

work page arXiv 2024
[64]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck Bilicki, Ziyang Ma, Qingyao Ai, et al. Hymba: A hybrid-head architecture for small language models. arXiv preprint arXiv:2411.13676, 2024

work page arXiv 2024
[65]

xlstm: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

work page arXiv 2024
[66]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023

work page 2023
[67]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Pedro Arcadinho, Eric Cao, Xin Cui, Zihang Dai, Jeff Eissman, Orhan Firat, Sophia Fu, Cong Gao, Yanping Hu, Maarten Hughes, James Kenealy, Maxim Krikun, Sneha Li, Yanping Li, Xiang Liu, Lianmin Luo, David McAllester, Matthew Olson, Alec Patel, Reiner Pope, Noam Rao, Alex Roberts, Noam Shazeer, Aditya S...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Paloma: A benchmark for evaluating language model fit

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Walsh, Yanai Elazar, Kyle Lo, et al. Paloma: A benchmark for evaluating language model fit. Advances in Neural Information Processing Systems , 37:64338–64376, 2024

work page 2024
[69]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[70]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[71]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[72]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Ashish Khotilovich, Liang Chen, and Yan Yang. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023

work page arXiv 2023
[73]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Exo: Run your own ai cluster at home with everyday devices

Exo Labs. Exo: Run your own ai cluster at home with everyday devices. https://github. com/exo-explore/exo, 2025. Accessed: 2025-01-29. 13

work page 2025
[75]

Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang

Yiping Kang, Johann Hauswald, Cao Gao, Andrew M. Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615–629. ACM, 2017

work page 2017
[76]

Yang, Jian Wu, and Meng Zhang

Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, and Meng Zhang. Moe2: Optimizing collaborative inference for edge large language models.arXiv preprint arXiv:2501.09410, 2025. Submitted to IEEE/ACM Transactions on Networking

work page arXiv 2025
[77]

Edge intelligence: On-demand deep learning model co- inference with device-edge synergy

En Li, Zhi Zhou, and Xu Chen. Edge intelligence: On-demand deep learning model co- inference with device-edge synergy. In Proceedings of the 2018 ACM/IEEE Symposium on Edge Computing, pages 31–46. IEEE, 2018

work page 2018
[78]

Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference

Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference. arXiv preprint arXiv:2405.17245, 2024

work page arXiv 2024
[79]

On- device training under 256kb memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On- device training under 256kb memory. Advances in Neural Information Processing Systems, 35:22941–22954, 2022

work page 2022
[80]

Tinytl: Reduce memory, not parameters for efficient on-device learning

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. In Advances in Neural Information Processing Systems, volume 33, pages 11285–11297, 2020

work page 2020
[81]

Zerofl: Efficient on-device training for federated learning with local sparsity

Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. Zerofl: Efficient on-device training for federated learning with local sparsity. In International Conference on Learning Representations, 2022

work page 2022
[82]

Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization

Keisuke Sugiura and Hiroki Matsutani. Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv preprint arXiv:2501.04287, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [4]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877– 1901, 2020

work page 1901

[5] [6]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [7]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc V Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009

work page 2009

[8] [10]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [11]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [12]

Introducing llama 3.1: Our most capable models to date

Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/ blog/meta-llama-3-1/ , 2024. Accessed: 2025-01-22

work page 2024

[11] [13]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [14]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020

work page 2020

[13] [15]

Introduction to federated learning

DeepLearning.AI. Introduction to federated learning. https://www.deeplearning.ai/ short-courses/intro-to-federated-learning/ , 2024. Accessed: 2025-02-23

work page 2024

[14] [16]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [17]

Data governance in the age of large language models

Stella Biderman, Kieran Schoelkopf, Anthony Weiss, and David Noever. Data governance in the age of large language models. arXiv preprint arXiv:2211.09911, 2022. 10

work page arXiv 2022

[16] [18]

Position: Will we run out of data? limits of llm scaling based on human-generated data

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. In International Conference on Machine Learning, pages 49523–49544. PMLR, 2024

work page 2024

[17] [19]

On the diversity of synthetic data and its impact on training large language models

Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I Abdin. On the diversity of synthetic data and its impact on training large language models. arXiv preprint arXiv:2410.15226, 2024

work page arXiv 2024

[18] [20]

Ai produces gibberish when trained on too much ai-generated data, 2024

Emily Wenger. Ai produces gibberish when trained on too much ai-generated data, 2024

work page 2024

[19] [21]

Bias of ai-generated content: an examination of news produced by large language models

Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. Bias of ai-generated content: an examination of news produced by large language models. Scientific Reports, 14(1):5224, 2024

work page 2024

[20] [22]

General data protection regulation

Protection Regulation. General data protection regulation. Intouch, 25:1–5, 2018

work page 2018

[21] [23]

Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024

Dean Hardy-White. Are ai scaling laws hitting a wall? https://www.linkedin.com/ pulse/ai-scaling-laws-hitting-wall-dean-hardy-white-xchfe/ , 2024. Ac- cessed: 2025-01-22

work page 2024

[22] [24]

Introducing grok-3

xAI. Introducing grok-3. https://x.ai/blog/grok-3, 2025. Accessed: 2025-02-23

work page 2025

[23] [25]

Deep learning’s diminishing returns

Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. Deep learning’s diminishing returns. IEEE Spectrum, 58(10):50–55, 2021

work page 2021

[24] [26]

Sharir, B

Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900, 2020

work page arXiv 2004

[25] [27]

Green ai

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63, 2020

work page 2020

[26] [28]

Artificial intelligence and competition policy

Andrei Hagiu and Julian Wright. Artificial intelligence and competition policy. International Journal of Industrial Organization, page 103134, 2025

work page 2025

[27] [29]

Frontier ai regulation: Managing emerging risks to public safety

Jack Thompson, Amanda Askell, and Jeffrey Song. Frontier ai regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2207.05257, 2022

work page arXiv 2022

[28] [30]

Trends in training dataset sizes

Pablo Villalobos and Anson Ho. Trends in training dataset sizes. Epoch AI Blog, 2022

work page 2022

[29] [31]

URLhttps: //aclanthology.org/2022.lrec-1.777/

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data. arXiv preprint arXiv:2211.04325, pages 13–29, 2024

work page arXiv 2024

[30] [32]

Compute trends across three eras of machine learning

Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022

work page 2022

[31] [33]

Tinygsm: Achieving 80% on gsm8k with small models

Bingbin Liu, Sébastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: Achieving 80% on gsm8k with small models. arXiv preprint arXiv:2312.09237, 2023

work page arXiv 2023

[32] [34]

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [35]

Strong model collapse

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024

work page arXiv 2024

[34] [36]

Self-consuming generative models go mad

Sina Alemohammad, Jose Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023

work page arXiv 2023

[35] [37]

Scaling laws of synthetic images for model training

Li Fan, Kaiming Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yuandong Tian. Scaling laws of synthetic images for model training. arXiv preprint arXiv:2306.09387, 2023. 11

work page arXiv 2023

[36] [38]

Trends in machine learning hardware,

Marius Hobbhahn, Lennart Heim, and Gökçe Aydos. Trends in machine learning hardware,

work page

[37] [39]

Accessed: 2025-01-27

work page 2025

[38] [40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[39] [41]

The end of moore’s law? innovation in computer systems continues

Henry Kressel. The end of moore’s law? innovation in computer systems continues. Artificial Intelligence in Science: Challenges, Opportunities and the Future of Research, 2023

work page 2023

[40] [42]

Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars

Benzinga Staff. Apple, nvidia secure future with taiwan semi’s advanced chips as ai demand soars. Benzinga, June 2024

work page 2024

[41] [43]

Ai’s hardware hunger: The global semiconductor supply chain under pressure

ScaleFlux Research. Ai’s hardware hunger: The global semiconductor supply chain under pressure. ScaleFlux Insights, 2024. Accessed: 2025-01-27

work page 2024

[42] [44]

V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

Statista global data volume. V olume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025, 2023

work page 2010

[43] [45]

Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

Statista IoT device data volume. Internet of things (iot) connected devices data size worldwide from 2019 to 2025, 2023

work page 2019

[44] [46]

Edge computing market size & share analysis report, 2023-2030, 2023

Grand View Research. Edge computing market size & share analysis report, 2023-2030, 2023

work page 2023

[45] [47]

How many smartphones are in the world?, 2023

BankMyCell. How many smartphones are in the world?, 2023

work page 2023

[46] [48]

Dataage white paper: The digitization of the world – from edge to core, 2019

Seagate. Dataage white paper: The digitization of the world – from edge to core, 2019

work page 2019

[47] [49]

Rethink data report 2020, 2020

Seagate. Rethink data report 2020, 2020

work page 2020

[48] [50]

A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications

Sabuzima Nayak, Ripon Patgiri, Lilapati Waikhom, and Arif Ahmed. A review on edge analytics: Issues, challenges, opportunities, promises, future directions, and applications. Digital Communications and Networks, 10(3):783–804, 2024

work page 2024

[49] [51]

Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023

Cavli Wireless. Edge Computing for IoT, Real-Time Data and Low Latency Processing, 2023. Accessed:2025-01-22

work page 2023

[50] [52]

Small language model as data prospector for large language model

Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, and Min Yang. Small language model as data prospector for large language model. arXiv preprint arXiv:2412.09990, 2024

work page arXiv 2024

[51] [53]

iphone 16 pro and 16 pro max - technical specifications, 2024

Apple Inc. iphone 16 pro and 16 pro max - technical specifications, 2024

work page 2024

[52] [54]

Nvidia jetson agx orin tflops specifications, 2023

NVIDIA. Nvidia jetson agx orin tflops specifications, 2023. Forum discussion clarifying sparse vs. dense TFLOPS

work page 2023

[53] [55]

NanoReview.net - Gadget Specifications and Comparisons

NanoReview.net. NanoReview.net - Gadget Specifications and Comparisons. https:// nanoreview.net, 2025. Accessed: 2025-02-23

work page 2025

[54] [56]

Canalys Newsroom - Market Analysis and Research

Canalys. Canalys Newsroom - Market Analysis and Research. https://canalys.com/ newsroom, 2025. Accessed: 2025-02-23

work page 2025

[55] [57]

Small language models: Survey, measurements, and insights

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790, 2024

work page arXiv 2024

[56] [58]

A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness

Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A comprehensive survey of small language mod- els in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness. arXiv preprint arXiv:2411.03350, 2024

work page arXiv 2024

[57] [59]

A survey of small language models

Chien Van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, et al. A survey of small language models. arXiv preprint arXiv:2410.20011, 2024. 12

work page arXiv 2024

[58] [60]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2020

work page arXiv 1909

[59] [61]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1909

[60] [62]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [63]

The zamba2 suite: Technical report

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report. arXiv preprint arXiv:2411.15242, 2024

work page arXiv 2024

[62] [64]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck Bilicki, Ziyang Ma, Qingyao Ai, et al. Hymba: A hybrid-head architecture for small language models. arXiv preprint arXiv:2411.13676, 2024

work page arXiv 2024

[63] [65]

xlstm: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

work page arXiv 2024

[64] [66]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023

work page 2023

[65] [67]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Pedro Arcadinho, Eric Cao, Xin Cui, Zihang Dai, Jeff Eissman, Orhan Firat, Sophia Fu, Cong Gao, Yanping Hu, Maarten Hughes, James Kenealy, Maxim Krikun, Sneha Li, Yanping Li, Xiang Liu, Lianmin Luo, David McAllester, Matthew Olson, Alec Patel, Reiner Pope, Noam Rao, Alex Roberts, Noam Shazeer, Aditya S...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [68]

Paloma: A benchmark for evaluating language model fit

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Walsh, Yanai Elazar, Kyle Lo, et al. Paloma: A benchmark for evaluating language model fit. Advances in Neural Information Processing Systems , 37:64338–64376, 2024

work page 2024

[67] [69]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[68] [70]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[69] [71]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[70] [72]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Ashish Khotilovich, Liang Chen, and Yan Yang. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023

work page arXiv 2023

[71] [73]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperform- ing larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [74]

Exo: Run your own ai cluster at home with everyday devices

Exo Labs. Exo: Run your own ai cluster at home with everyday devices. https://github. com/exo-explore/exo, 2025. Accessed: 2025-01-29. 13

work page 2025

[73] [75]

Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang

Yiping Kang, Johann Hauswald, Cao Gao, Andrew M. Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615–629. ACM, 2017

work page 2017

[74] [76]

Yang, Jian Wu, and Meng Zhang

Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, and Meng Zhang. Moe2: Optimizing collaborative inference for edge large language models.arXiv preprint arXiv:2501.09410, 2025. Submitted to IEEE/ACM Transactions on Networking

work page arXiv 2025

[75] [77]

Edge intelligence: On-demand deep learning model co- inference with device-edge synergy

En Li, Zhi Zhou, and Xu Chen. Edge intelligence: On-demand deep learning model co- inference with device-edge synergy. In Proceedings of the 2018 ACM/IEEE Symposium on Edge Computing, pages 31–46. IEEE, 2018

work page 2018

[76] [78]

Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference

Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference. arXiv preprint arXiv:2405.17245, 2024

work page arXiv 2024

[77] [79]

On- device training under 256kb memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On- device training under 256kb memory. Advances in Neural Information Processing Systems, 35:22941–22954, 2022

work page 2022

[78] [80]

Tinytl: Reduce memory, not parameters for efficient on-device learning

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. In Advances in Neural Information Processing Systems, volume 33, pages 11285–11297, 2020

work page 2020

[79] [81]

Zerofl: Efficient on-device training for federated learning with local sparsity

Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. Zerofl: Efficient on-device training for federated learning with local sparsity. In International Conference on Learning Representations, 2022

work page 2022

[80] [82]

Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization

Keisuke Sugiura and Hiroki Matsutani. Elasticzo: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv preprint arXiv:2501.04287, 2025

work page arXiv 2025