LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
A systematic review of on-device AI inference security finds defenses are imbalanced, with roughly half focused on IP theft while one-third of attacks (adversarial examples) lack any associated defenses.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Industry AI practitioners view model quality through nine attributes with context-dependent priorities, where data imbalance is a key challenge addressed by strategies like active learning, as confirmed by interviews and a follow-up survey.
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
citing papers explorer
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Protecting On-Device AI Inference: A Systematic Review of Attacks and Defence Mechanisms
A systematic review of on-device AI inference security finds defenses are imbalanced, with roughly half focused on IP theft while one-third of attacks (adversarial examples) lack any associated defenses.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
Industry Practitioners Perspectives on AI Model Quality: Perceptions, Challenges, and Solutions
Industry AI practitioners view model quality through nine attributes with context-dependent priorities, where data imbalance is a key challenge addressed by strategies like active learning, as confirmed by interviews and a follow-up survey.
-
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.