arxiv: 2605.10886 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Liang Luo , Yinbin Ma , Quanyu Zhu , Vasiliy Kuznetsov , Yuxin Chen , Jian Jiao , Jiecao Yu , Buyun Zhang

show 15 more authors

Tongyi Tang Xiaohan Wei Yanli Zhao Zeliang Chen Yuchen Hao Venkatesh Ranganathan Sandeep Parab Yantao Yao Maxim Naumov Chunzhi Yang Shen Li Ellie Wen Wenlin Chen Santanu Kolay Chunqiang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords FP8low-precisionrecommendation modelsGEMMkernel optimizationnumerical stabilitymodel co-designprofiling

0 comments

The pith

LoKA framework makes FP8 practical for large recommendation models by profiling safe sites, adapting components, and dispatching kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large recommendation models resist direct FP8 application because small matrix multiplications and normalization steps make them numerically sensitive while training runs in communication-heavy environments. LoKA solves this with three linked steps: first profiling activations and weights under realistic data distributions to measure per-layer errors, then applying reusable model adaptations that stabilize calculations and boost speed, and finally using a runtime to pick the fastest kernel that meets accuracy targets. This system-model co-design expands the usable FP8 regions beyond what standalone kernels can achieve and shortens training time without new hardware.

Core claim

LoKA is a framework that integrates LoKA Probe, a statistically grounded online method that learns activation and weight statistics to quantify per-layer errors and mark safe versus unsafe FP8 sites, LoKA Mods, a set of reusable adaptations that improve numerical stability and execution efficiency under FP8, and LoKA Dispatch, a runtime that uses the profiling data to select the fastest compliant FP8 kernel for each operation.

What carries the argument

LoKA Probe, the statistically grounded online benchmarking method that learns activation and weight statistics under realistic distributions and quantifies per-layer errors to identify safe FP8 sites.

If this is right

FP8 can be applied to more operations inside LRMs once safe sites are located by realistic profiling.
Model adaptations expand the regions where low precision remains stable and efficient.
Runtime kernel selection delivers the highest throughput while satisfying accuracy constraints.
Overall training throughput rises while model quality stays comparable to higher-precision runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Periodic re-profiling may be needed if training data distributions drift over many epochs.
Energy use in large-scale recommendation training could fall as FP8 replaces higher precision in more layers.
Hardware designers might prioritize better FP8 support and mixed-precision scheduling for recommendation workloads.
The same profiling-plus-adaptation pattern could be tested on other numerically sensitive models such as those used in ranking or retrieval tasks.

Load-bearing premise

The statistical profiling from LoKA Probe accurately identifies all safe FP8 sites without missing interactions or distribution shifts that would degrade overall model quality during full training.

What would settle it

A complete training run of a large recommendation model under LoKA's FP8 configuration that shows a measurable drop in final model quality metrics relative to the FP16 or FP32 baseline.

Figures

Figures reproduced from arXiv: 2605.10886 by Buyun Zhang, Chunqiang Tang, Chunzhi Yang, Ellie Wen, Jian Jiao, Jiecao Yu, Liang Luo, Maxim Naumov, Quanyu Zhu, Sandeep Parab, Santanu Kolay, Shen Li, Tongyi Tang, Vasiliy Kuznetsov, Venkatesh Ranganathan, Wenlin Chen, Xiaohan Wei, Yanli Zhao, Yantao Yao, Yinbin Ma, Yuchen Hao, Yuxin Chen, Zeliang Chen.

**Figure 2.** Figure 2: A typical model architecture of a LRM. problem — selecting the fastest implementation that satisfies an accuracy constraint — consistently outperforms any uniform policy. LoKA adopts LoKA Dispatch, a unified runtime mechanism that integrates multiple low-precision libraries and dynamically selects the fastest kernel that satisfies both accuracy and throughput constraints. Together, these principles ensure … view at source ↗

**Figure 1.** Figure 1: LoKA Overview We present LoKA (Low-precision Kernel Applications), a framework designed to unlock the benefits of FP8 and emerging precisions for large-scale recommendation models . LoKA is built on top of three principles ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Significant Relative Log Loss (top) and throughput (bottom) degradation [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Compute throughput ablation of low-precision kernels on representative [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: LoKA Probe learns and stores necessary parameters online for offline [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Typical behaviors of bias norm of in Wukong training. Biases can [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: BlockNorm design padding [44], bias tensors smaller than the world size can incur significant overheads. 2) Block-wise Normalization: Our objective is to fuse normalization directly into the GEMM epilogue to minimize HBM I/O. While similar to epilogue fusion [65], our application is in a different context. By performing normalization immediately after GEMM completion while the output tiles still reside in… view at source ↗

**Figure 8.** Figure 8: Hard Swish and BlockNorm with sufficiently large block size converges [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Lossless full-trajectory FP8 training of Wukong, Interformer and [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: End-to-end Speedup of LoKA Training (Left) and Inference (Right) [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: End-to-end latency breakdown of training with and without LoKA [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Scalability of LoKA on Wukong training, varying number of GPUs. N/A: configuration invalid. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Assessing LoKA Mods’ effectiveness on reducing latency of common [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

read the original abstract

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoKA sketches a co-design path for FP8 in recommendation models, but the abstract leaves the actual performance gains unproven.

read the letter

This paper introduces LoKA as a way to bring FP8 arithmetic into large recommendation models without tanking quality or slowing things down too much. The approach rests on three ideas: use statistical profiling on real data to find where low precision is safe, tweak the model components to make more places safe, and then pick the best kernel at runtime. What is new here is the focus on LRMs rather than LLMs. Recommendation models have lots of small GEMMs followed by normalization layers and run in settings with heavy all-to-all communication. The paper explains why just swapping in better FP8 kernels won't cut it and why a system-model co-design is necessary. The components are clearly named: LoKA Probe for the online benchmarking of activations and weights, LoKA Mods for the adaptations, and LoKA Dispatch for the selection logic. The paper does a good job laying out the problem and the high-level solution. It avoids overclaiming by grounding the method in empirical profiling instead of theoretical bounds. The main soft spot is that we only have the abstract, which describes the method but shows no results. There are no numbers on how much error the profiling allows, what the speedups are, or whether full training runs stay stable. The stress-test point about cumulative error propagation through layers and distribution shifts under training is a real one that needs checking. If the full paper has ablations showing that per-layer decisions hold up end-to-end, that would strengthen it a lot. This work is aimed at engineers and researchers building production-scale rec systems who care about hardware efficiency. A reader working on low-precision training or industrial ML infrastructure would get value from the concrete framework. I would bring this to a reading group to talk through the profiling technique. It deserves serious peer review because it addresses a practical bottleneck with a thoughtful structure, even though the current description is high-level and will need detailed experiments to stand up.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LoKA, a framework for applying FP8 low-precision arithmetic to large recommendation models (LRMs). It relies on three principles: LoKA Probe for online statistical profiling of activations/weights to identify safe FP8 sites via per-layer error quantification, LoKA Mods for reusable model adaptations that enhance numerical stability and efficiency, and LoKA Dispatch for runtime selection of the fastest compliant FP8 kernel. The approach targets LRM-specific issues including numerical sensitivity, small GEMMs followed by normalization, and communication-heavy training environments.

Significance. If the profiling and adaptations prove robust, the work could meaningfully advance low-precision adoption in production-scale recommendation systems by delivering efficiency gains without quality degradation. The co-design emphasis on realistic distributions and hardware-aware modifications offers a structured alternative to kernel-only solutions and may generalize to other sensitive workloads.

major comments (2)

[LoKA Probe description] The central claim that LoKA Probe correctly identifies all safe FP8 sites rests on per-layer statistical benchmarking, yet the description supplies no validation that isolated per-layer error bounds translate to stable end-to-end model quality; cumulative propagation through the embedding-to-logit path and SGD-induced distribution shifts are unaddressed.
[LoKA Mods and LoKA Dispatch] LoKA Mods and Dispatch presuppose that profiled sites remain safe throughout full training runs, but no experiments or analysis demonstrate that the adaptations prevent quality loss under realistic LRM training dynamics (e.g., long-horizon SGD with inter-layer normalization dependencies).

minor comments (2)

The abstract states that LoKA Probe 'quantifies per-layer errors' but does not define the error metric (e.g., relative L2, maximum absolute deviation) or the acceptance threshold used to classify sites as safe.
Clarify how 'fast and slow sites' are distinguished during profiling and whether this classification incorporates both arithmetic throughput and communication costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation gaps in LoKA Probe and the need for evidence under full training dynamics. We agree these points require strengthening and will revise the manuscript with additional end-to-end experiments and long-horizon analysis while preserving the core co-design contributions.

read point-by-point responses

Referee: [LoKA Probe description] The central claim that LoKA Probe correctly identifies all safe FP8 sites rests on per-layer statistical benchmarking, yet the description supplies no validation that isolated per-layer error bounds translate to stable end-to-end model quality; cumulative propagation through the embedding-to-logit path and SGD-induced distribution shifts are unaddressed.

Authors: We appreciate this observation. LoKA Probe employs conservative per-layer error quantification under realistic activation distributions precisely to bound potential propagation effects, and its online profiling is meant to track SGD-induced shifts. However, the manuscript does not include explicit end-to-end validation showing that per-layer decisions preserve full-model quality across the embedding-to-logit path. In revision we will add full-training experiments comparing LoKA-enabled models against FP16 baselines, with measurements of cumulative error and quality metrics at multiple training checkpoints. revision: yes
Referee: [LoKA Mods and LoKA Dispatch] LoKA Mods and Dispatch presuppose that profiled sites remain safe throughout full training runs, but no experiments or analysis demonstrate that the adaptations prevent quality loss under realistic LRM training dynamics (e.g., long-horizon SGD with inter-layer normalization dependencies).

Authors: This is a fair critique. While LoKA Mods are designed to improve numerical stability for normalization-heavy small GEMMs and Dispatch enforces accuracy constraints at runtime, the current text lacks dedicated long-horizon experiments. We will incorporate ablation studies and training curves over extended SGD runs that explicitly track inter-layer normalization dependencies and demonstrate that the combined adaptations maintain model quality without degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: LoKA relies on empirical profiling and co-design without self-referential derivations

full rationale

The paper describes a practical systems framework consisting of LoKA Probe for statistical online benchmarking of activation/weight distributions and per-layer errors, LoKA Mods for model adaptations to improve FP8 stability, and LoKA Dispatch for runtime kernel selection. No equations, uniqueness theorems, or fitted parameters are presented that reduce the central claims to their own inputs by construction. The approach is grounded in external empirical measurements and hardware co-design rather than self-definition or self-citation chains, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on hardware assumptions about FP8 speedups and the feasibility of targeted model adaptations, with new components introduced without external validation in the provided text.

axioms (2)

domain assumption FP8 delivers significant speedups on modern GPUs for GEMM operations
Invoked as the motivation for adopting low precision in the abstract.
ad hoc to paper LRM numerical sensitivity can be mitigated by localized model adaptations without global quality loss
Central to the LoKA Mods component and the claim that co-design expands safe FP8 regions.

invented entities (3)

LoKA Probe no independent evidence
purpose: Online statistical benchmarking to quantify per-layer FP8 errors under realistic distributions
New profiling method introduced to identify safe FP8 sites.
LoKA Mods no independent evidence
purpose: Reusable model adaptations that improve FP8 numerical stability and efficiency
New set of modifications proposed as part of the co-design.
LoKA Dispatch no independent evidence
purpose: Runtime selector that chooses fastest FP8 kernel meeting accuracy requirements
New orchestration component leveraging profiling insights.

pith-pipeline@v0.9.0 · 5642 in / 1500 out tokens · 48319 ms · 2026-05-15T04:52:02.893368+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 14 internal anchors

[1]

Matrix multiplication background user’s guide - nvidia docs,

“Matrix multiplication background user’s guide - nvidia docs,” [Online; accessed 2025-09-11]. [Online]. Avail- able: https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html#wave-quant

work page 2025
[2]

The falcon series of open language models,

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, ´Etienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo, “The falcon series of open language models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16867

work page arXiv 2023
[3]

Introduction — amd quark 0.10 documentation,

AMD, “Introduction — amd quark 0.10 documentation,” [Online; accessed 2025-10-29]. [Online]. Available: https://quark.docs.amd.com/ latest/onnx/tutorial microexponents quantization.html

work page 2025
[4]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brook...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Dot product matrix compression for machine learning,

Anonymous, “Dot product matrix compression for machine learning,” Technical Disclosure Commons, 2019

work page 2019
[6]

In: 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)

J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

work page doi:10.1145/3620665.3640366 2024
[7]

Un- derstanding scaling laws for recommendation models,

N. Ardalani, C.-J. Wu, Z. Chen, B. Bhushanam, and A. Aziz, “Un- derstanding scaling laws for recommendation models,”arXiv preprint arXiv:2208.08489, 2022

work page arXiv 2022
[8]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

work page 2024
[9]

Halo: Hadamard-assisted lower-precision optimization for llms,

S. Ashkboos, M. Nikdan, S. Tabesh, R. L. Castro, T. Hoefler, and D. Alistarh, “Halo: Hadamard-assisted lower-precision optimization for llms,”arXiv preprint arXiv:2501.02625, 2025

work page arXiv 2025
[10]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016. [Online]. Available: https://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Post-training 4-bit quantization of convolution networks for rapid-deployment

R. Banner, Y . Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” 2019. [Online]. Available: https://arxiv.org/abs/1810.05723

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Quartet: Native fp4 training can be optimal for large language models,

R. L. Castro, A. Panferov, S. Tabesh, O. Sieberling, J. Chen, M. Nikdan, S. Ashkboos, and D. Alistarh, “Quartet: Native fp4 training can be optimal for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.14669

work page arXiv 2025
[13]

Algorithms for computing the sample variance: Analysis and recommendations,

T. F. Chan, G. H. Golub, and R. J. LeVeque, “Algorithms for computing the sample variance: Analysis and recommendations,”The American Statistician, vol. 37, no. 3, pp. 242–247, 1983. 12

work page 1983
[14]

NetHint: White-Box networking for Multi-Tenant data centers,

J. Chen, H. Zhang, W. Zhang, L. Luo, J. Chase, I. Stoica, and D. Zhuo, “NetHint: White-Box networking for Multi-Tenant data centers,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022, pp. 1327–1343. [Online]. Available: https: //www.usenix.org/conference/nsdi22/presentation/chen-jingrong

work page 2022
[15]

Adaptive factorization network: Learning adaptive-order feature interactions,

W. Cheng, Y . Shen, and L. Huang, “Adaptive factorization network: Learning adaptive-order feature interactions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3609– 3616

work page 2020
[16]

arXiv preprint arXiv:2505.19115 , year=

B. Chmiel, M. Fishman, R. Banner, and D. Soudry, “Fp4 all the way: Fully quantized training of llms,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19115

work page arXiv 2025
[17]

Deepgemm: clean and efficient fp8 gemm kernels with fine-grained scaling,

DeepSeek, “Deepgemm: clean and efficient fp8 gemm kernels with fine-grained scaling,” [Online; accessed 2025-08-29]. [Online]. Available: https://github.com/deepseek-ai/DeepGEMM

work page 2025
[18]

Deepgemm numerical test,

DeepSeek-AI, “Deepgemm numerical test,” https://github.com/deepseek- ai/DeepGEMM/blob/main/tests/test bf16.py#L38, [Accessed 17-02- 2026]

work page 2026
[19]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, C. Lu, C. Zhao, C. Deng, C. Xu, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, E. Li, F. Zhou, F. Lin, F. Dai, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Li, H. Liang, H. Wei, H. Zhang, H. Luo, H. Ji, H. Ding, H. Tang, H. Cao, H. Gao, H. Qu, H. Zeng, J. Huang, J. L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

K. Deng, H. Zheng, M. Qing, K. Zhu, G. Li, Y . Xiao, L. E. Zhang, L. Guo, B. Hui, Y . Wang, G. Yuan, G. Agrawal, W. Niu, and X. Ma, “From bits to chips: An llm-based hardware-aware quantization agent for streamlined deployment of llms,” 2026. [Online]. Available: https://arxiv.org/abs/2601.03484

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022. [Online]. Available: https://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Nvfp4 trains with precision of 16-bit and speed and effi- ciency of 4-bit — nvidia technical blog,

K. Devleker, “Nvfp4 trains with precision of 16-bit and speed and effi- ciency of 4-bit — nvidia technical blog,” 8 2025, [Online; accessed 2025- 08-31]. [Online]. Available: https://developer.nvidia.com/blog/nvfp4- trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

work page 2025
[24]

Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels,

R.-G. Dumitru, V . Yadav, R. Maheshwary, P.-I. Clotan, S. T. Madhusudhan, and M. Surdeanu, “Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels,”

work page
[25]

Available: https://arxiv.org/abs/2406.17415

[Online]. Available: https://arxiv.org/abs/2406.17415

work page arXiv
[26]

Learned step size quantization,

S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” 2020. [Online]. Available: https://arxiv.org/abs/1902.08153

work page arXiv 2020
[27]

Adaptive gradient quantization for data-parallel sgd,

F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. Roy, and A. Ramezani- Kebrya, “Adaptive gradient quantization for data-parallel sgd,” 2020. [Online]. Available: https://arxiv.org/abs/2010.12460

work page arXiv 2020
[28]

Fbgemm numerical test,

FBGEMM, “Fbgemm numerical test,” https://github.com/pytorch/ FBGEMM/blob/main/fbgemm gpu/test/quantize/fused 8bit rowwise test.py#L61)rely, [Accessed 17-02-2026]

work page 2026
[29]

Enabling float8 all-gather in fsdp2 - distributed - pytorch developer mailing list,

W. Feng, “Enabling float8 all-gather in fsdp2 - distributed - pytorch developer mailing list,” 8 2024, [Online; accessed 2025-10-29]. [Online]. Available: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in- fsdp2/2359

work page 2024
[30]

Scaling fp8 training to trillion-token llms,

M. Fishman, B. Chmiel, R. Banner, and D. Soudry, “Scaling fp8 training to trillion-token llms,”arXiv preprint arXiv:2409.12517, 2024

work page arXiv 2024
[31]

Deck: Experiences on delta checkpointing for industrial recommendation systems,

X. Gao, S. Acharya, S. Han, Y . Ren, Y . Zhao, L. Luo, C. Wang, P. Fernando, S. Mishra, S. Yanet al., “Deck: Experiences on delta checkpointing for industrial recommendation systems,”Proceedings of the VLDB Endowment, vol. 18, no. 12, pp. 4978–4990, 2025

work page 2025
[32]

Vip5: Towards multimodal foundation models for recommendation,

S. Geng, J. Tan, S. Liu, Z. Fu, and Y . Zhang, “Vip5: Towards multimodal foundation models for recommendation,”arXiv preprint arXiv:2305.14302, 2023

work page arXiv 2023
[33]

A survey of quantization methods for efficient neural network inference,

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,”

work page
[34]

Available: https://arxiv.org/abs/2103.13630

[Online]. Available: https://arxiv.org/abs/2103.13630

work page arXiv
[35]

On the embedding collapse when scaling up recommendation models,

X. Guo, J. Pan, X. Wang, B. Chen, J. Jiang, and M. Long, “On the embedding collapse when scaling up recommendation models,”arXiv preprint arXiv:2310.04400, 2023

work page arXiv 2023
[36]

Matrix algebra from a statistician’s perspective,

D. A. Harville, “Matrix algebra from a statistician’s perspective,” 1998

work page 1998
[37]

Towards fully fp8 gemm llm training at scale,

A. Hern´andez-Cano, D. Garbaya, I. Schlag, and M. Jaggi, “Towards fully fp8 gemm llm training at scale,”arXiv preprint arXiv:2505.20524, 2025

work page arXiv 2025
[38]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” 2016. [Online]. Available: https: //arxiv.org/abs/1609.07061

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

Torchrec: a pytorch domain library for recommendation systems,

D. Ivchenko, D. Van Der Staay, C. Taylor, X. Liu, W. Feng, R. Kindi, A. Sudarshan, and S. Sefati, “Torchrec: a pytorch domain library for recommendation systems,” inProceedings of the 16th ACM Conference on Recommender Systems, 2022, pp. 482–483

work page 2022
[40]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” 2017. [Online]. Available: https://arxiv.org/abs/1712.05877

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Massive values in self-attention modules are the key to contextual knowledge understanding,

M. Jin, K. Mei, W. Xu, M. Sun, R. Tang, M. Du, Z. Liu, and Y . Zhang, “Massive values in self-attention modules are the key to contextual knowledge understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01563

work page arXiv 2025
[42]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. 13

work page internal anchor Pith review Pith/arXiv arXiv 2001
[43]

Fbgemm: Enabling high-performance low- precision deep learning inference,

D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyanskiy, “Fbgemm: Enabling high-performance low- precision deep learning inference,” 2021. [Online]. Available: https: //arxiv.org/abs/2101.05615

work page arXiv 2021
[44]

Scout before you attend: Sketch-and-walk sparse attention for efficient llm inference,

H. A. D. Le, S. Joshi, Z. Yang, Z. Xu, and A. Shrivastava, “Scout before you attend: Sketch-and-walk sparse attention for efficient llm inference,” 2026. [Online]. Available: https://arxiv.org/abs/2602.07397

work page arXiv 2026
[45]

External large foundation model: How to efficiently serve trillions of parameters for online ads recommendation,

M. Liang, X. Liu, R. Jin, B. Liu, Q. Suo, Q. Zhou, S. Zhou, L. Chen, H. Zheng, Z. Li, S. Jiang, J. Yang, X. Xia, F. Yang, Y . Badr, E. Wen, S. Xu, H. Chen, Z. Zhang, J. Nie, C. Yang, Z. Zeng, W. Zhang, X. Huang, Q. Li, S. Wang, E. Lyu, W. Lu, R. Zhang, W. Wang, J. Rudy, M. Hang, K. Wang, Y . Ma, S. Wang, S. Zeng, T. Tang, X. Wei, L. Jin, J. Zhang, M. Chen...

work page
[46]

Available: https://arxiv.org/abs/2502.17494

[Online]. Available: https://arxiv.org/abs/2502.17494

work page arXiv
[47]

June 7, 2025.DOI:10.48550/arXiv.2410.06511

W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos, “Torchtitan: One-stop pytorch native solution for production ready llm pre-training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06511

work page arXiv 2024
[48]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” inProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 87–100. [Online]. Available: https://proceedings.mlsys.o...

work page 2024
[49]

Parameter hub: a rack-scale parameter server for distributed deep neural network training,

L. Luo, J. Nelson, L. Ceze, A. Phanishayee, and A. Krishnamurthy, “Parameter hub: a rack-scale parameter server for distributed deep neural network training,” inProceedings of the ACM Symposium on Cloud Computing, 2018, pp. 41–54

work page 2018
[50]

Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud,

L. Luo, P. West, J. Nelson, A. Krishnamurthy, and L. Ceze, “Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud,”Proceedings of Machine Learning and Systems, vol. 2, pp. 82–97, 2020

work page 2020
[51]

Disaggregated multi- tower: Topology-aware modeling technique for efficient large-scale recommendation,

L. Luo, B. Zhang, M. Tsang, Y . Ma, C.-H. Chu, Y . Chen, S. Li, Y . Hao, Y . Zhao, G. Lakshminarayananet al., “Disaggregated multi- tower: Topology-aware modeling technique for efficient large-scale recommendation,”arXiv preprint arXiv:2403.00877, 2024

work page arXiv 2024
[52]

Mixed Precision Training

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” 2018. [Online]. Available: https://arxiv.org/abs/1710.03740

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

High-performance, distributed training of large-scale deep learning recommendation models,

D. Mudigere, Y . Hao, J. Huang, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luoet al., “High-performance, distributed training of large-scale deep learning recommendation models,”arXiv preprint arXiv:2104.05158, 2021

work page arXiv 2021
[54]

A white paper on neural network quantization,

M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. van Baalen, and T. Blankevoort, “A white paper on neural network quantization,”

work page
[55]

Available: https://arxiv.org/abs/2106.08295

[Online]. Available: https://arxiv.org/abs/2106.08295

work page arXiv
[57]

Deep Learning Recommendation Model for Personalization and Recommendation Systems

M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzoliniet al., “Deep learning recommendation model for personalization and recommendation systems,”arXiv preprint arXiv:1906.00091, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[58]

1. nvidia ampere gpu architecture tuning guide — ampere tuning guide 13.0 documentation,

Nvidia, “1. nvidia ampere gpu architecture tuning guide — ampere tuning guide 13.0 documentation,” [Online; accessed 2025-09-12]. [Online]. Available: https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html

work page 2025
[59]

[Online]

NVIDIA, “Github - nvidia/transformerengine: A library for accelerating transformer models on nvidia gpus, including using 8-bit floating point (fp8) precision on hopper, ada and blackwell gpus, to provide better performance with lower memory utilization in both training and inference.” [Online; accessed 2025-10-29]. [Online]. Available: https://github.com...

work page 2025
[60]

arXiv preprint arXiv:2509.25149 , year=

NVIDIA, F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, I. Buck, B. Catanzaro, J. Choi, M. Chrzanowski, E. Chung, V . Cui, S. Dai, B. D. Rouhani, C. del Mundo, D. Donia, B. Eryilmaz, H. Estela, A. Goel, O. Goncharov, Y . Guvvala, R. Hesse, R. Hewett, H. Hum, U. Kapasi, B. Khailany, ...

work page arXiv 2025
[61]

Nvidia a100 gpu datasheet,

NVIDIA Corporation, “Nvidia a100 gpu datasheet,” [Online; accessed 2025-08-25]. [Online]. Available: https://www.nvidia.com/content/dam/ en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

work page 2025
[62]

Nvidia b200 gpu datasheet,

——, “Nvidia b200 gpu datasheet,” [Online; accessed 2025-08-25]. [Online]. Available: https://nvdam.widen.net/s/wwnsxrhm2w/blackwell- datasheet-3384703

work page 2025
[63]

Nvidia h100 gpu datasheet,

——, “Nvidia h100 gpu datasheet,” [Online; accessed 2025-08- 25]. [Online]. Available: https://resources.nvidia.com/en-us-hopper- architecture/nvidia-tensor-core-gpu-datasheet?ncid=no-ncid

work page 2025
[64]

Torchao: Pytorch-native training-to-serving model optimization,

A. Or, A. Jain, D. Vega-Myhre, J. Cai, C. D. Hernandez, Z. Zheng, D. Guessous, V . Kuznetsov, C. Puhrsch, M. Saroufim, S. Rao, T. Tran, and A. Samard ˇzi´c, “Torchao: Pytorch-native training-to-serving model optimization,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16099

work page arXiv 2025
[65]

Evaluating model performance with hard-swish activation function adjustments,

S. A. Pydimarry, S. M. Khairnar, S. G. Palacios, G. Sankaranarayanan, D. Hoagland, D. Nepomnayshy, and H. P. Nguyen, “Evaluating model performance with hard-swish activation function adjustments,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06879

work page arXiv 2024
[66]

Swish: a self-gated activation function,

P. Ramachandran, B. Zoph, and Q. V . Le, “Swish: a self-gated activation function,”arXiv: Neural and Evolutionary Computing, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:196158220

work page 2017
[67]

arXiv preprint arXiv:2310.10537 , year=

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V . Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrill...

work page arXiv 2023
[68]

Improving training stability for multitask ranking models in recommender systems,

J. Tang, Y . Drori, D. Chang, M. Sathiamoorthy, J. Gilmer, L. Wei, X. Yi, L. Hong, and E. H. Chi, “Improving training stability for multitask ranking models in recommender systems,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 4882–48...

work page doi:10.1145/3580305.3599846 2023
[69]

Triton: an intermediate language and compiler for tiled neural network computations,

P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10...

work page doi:10.1145/3315508.3329973 2019
[70]

Torchao blockwise triton test,

TorchAO, “Torchao blockwise triton test,” https://github.com/pytorch/ ao/blob/main/test/kernel/test blockwise triton.py#L55, [Accessed 17-02- 2026]

work page 2026
[71]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[72]

Haq: Hardware-aware automated quantization with mixed precision,

K. Wang, Z. Liu, Y . Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8612–8620

work page 2019
[73]

arXiv preprint arXiv:2501.17116 , year=

R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z. Zha, 14 and P. Cheng, “Optimizing large language model training using fp4 quantization,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17116

work page arXiv 2025
[74]

Dcn v2: Improved deep & cross network and practical lessons for web- scale learning to rank systems,

R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep & cross network and practical lessons for web- scale learning to rank systems,” inProceedings of the web conference 2021, 2021, pp. 1785–1797

work page 2021
[75]

Group Normalization

Y . Wu and K. He, “Group normalization,” 2018. [Online]. Available: https://arxiv.org/abs/1803.08494

work page internal anchor Pith review Pith/arXiv arXiv 2018
[76]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” 2024. [Online]. Available: https: //arxiv.org/abs/2211.10438

work page arXiv 2024
[77]

Training deep learning recommendation model with quantized collective communications,

J. A. Yang, J. Park, S. Sridharan, and P. T. P. Tang, “Training deep learning recommendation model with quantized collective communications,” in Conference on Knowledge Discovery and Data Mining (KDD), 2020, p. 95

work page 2020
[78]

Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction,

Z. Zeng, X. Liu, M. Hang, X. Liu, Q. Zhou, C. Yang, Y . Liu, Y . Ruan, L. Chen, Y . Chen, Y . Hao, J. Xu, J. Nie, X. Liu, B. Zhang, W. Wen, S. Yuan, K. Wang, W.-Y . Chen, Y . Han, H. Li, C. Yang, B. Long, P. S. Yu, H. Tong, and J. Yang, “Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction,” 2024. [Online]. A...

work page arXiv 2024
[79]

Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,

D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y . Hu, J. Nie, Y . Huang, Y . Tian, A. Kejariwalet al., “Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,”Proceedings of Machine Learning and Systems, vol. 5, 2023

work page 2023
[80]

Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,

J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y . Lu, and Y . Shi, “Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,” in Proceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kol...

work page 2024
[81]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,”

work page

Showing first 80 references.