MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

Hanxian Huang , Igor Fedorov , Andrey Gromov , Bernard Beckerman , Naveen Suda , David Eriksson , Maximilian Balandat , Rylan Conway , Patrick Huber , Chinnadhurai Sankar , Ayushi Dalmia , Zechun Liu , Lemeng Wu , Tarek Elgamal , Adithya Sagar , Vikas Chandra , Raghuraman Krishnamoorthi

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords latencymodelsattentiondeploymentdesignmobilemobilellm-flashon-device

0 comments

read the original abstract

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post-Selection Distributional Model Evaluation
stat.ML 2026-03 unverdicted novelty 7.0

PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.