pith. machine review for the scientific record. sign in

arxiv: 2508.16703 · v4 · submitted 2025-08-22 · 💻 cs.PF · cs.AI· cs.LG

Recognition: unknown

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Daliang Xu, Gang Huang, Mengwei Xu, Wangsong Yin, Xuanzhe Liu

Authors on Pith no claims yet
classification 💻 cs.PF cs.AIcs.LG
keywords attentionshadowattncomputeframeworkson-deviceperformanceresourcesystem
0
0 comments X
read the original abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

  2. EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

    cs.OS 2026-04 unverdicted novelty 6.0

    EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.