pith. machine review for the scientific record. sign in

arxiv: 2605.10886 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords FP8low-precisionrecommendation modelsGEMMkernel optimizationnumerical stabilitymodel co-designprofiling
0
0 comments X

The pith

LoKA framework makes FP8 practical for large recommendation models by profiling safe sites, adapting components, and dispatching kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large recommendation models resist direct FP8 application because small matrix multiplications and normalization steps make them numerically sensitive while training runs in communication-heavy environments. LoKA solves this with three linked steps: first profiling activations and weights under realistic data distributions to measure per-layer errors, then applying reusable model adaptations that stabilize calculations and boost speed, and finally using a runtime to pick the fastest kernel that meets accuracy targets. This system-model co-design expands the usable FP8 regions beyond what standalone kernels can achieve and shortens training time without new hardware.

Core claim

LoKA is a framework that integrates LoKA Probe, a statistically grounded online method that learns activation and weight statistics to quantify per-layer errors and mark safe versus unsafe FP8 sites, LoKA Mods, a set of reusable adaptations that improve numerical stability and execution efficiency under FP8, and LoKA Dispatch, a runtime that uses the profiling data to select the fastest compliant FP8 kernel for each operation.

What carries the argument

LoKA Probe, the statistically grounded online benchmarking method that learns activation and weight statistics under realistic distributions and quantifies per-layer errors to identify safe FP8 sites.

If this is right

  • FP8 can be applied to more operations inside LRMs once safe sites are located by realistic profiling.
  • Model adaptations expand the regions where low precision remains stable and efficient.
  • Runtime kernel selection delivers the highest throughput while satisfying accuracy constraints.
  • Overall training throughput rises while model quality stays comparable to higher-precision runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Periodic re-profiling may be needed if training data distributions drift over many epochs.
  • Energy use in large-scale recommendation training could fall as FP8 replaces higher precision in more layers.
  • Hardware designers might prioritize better FP8 support and mixed-precision scheduling for recommendation workloads.
  • The same profiling-plus-adaptation pattern could be tested on other numerically sensitive models such as those used in ranking or retrieval tasks.

Load-bearing premise

The statistical profiling from LoKA Probe accurately identifies all safe FP8 sites without missing interactions or distribution shifts that would degrade overall model quality during full training.

What would settle it

A complete training run of a large recommendation model under LoKA's FP8 configuration that shows a measurable drop in final model quality metrics relative to the FP16 or FP32 baseline.

Figures

Figures reproduced from arXiv: 2605.10886 by Buyun Zhang, Chunqiang Tang, Chunzhi Yang, Ellie Wen, Jian Jiao, Jiecao Yu, Liang Luo, Maxim Naumov, Quanyu Zhu, Sandeep Parab, Santanu Kolay, Shen Li, Tongyi Tang, Vasiliy Kuznetsov, Venkatesh Ranganathan, Wenlin Chen, Xiaohan Wei, Yanli Zhao, Yantao Yao, Yinbin Ma, Yuchen Hao, Yuxin Chen, Zeliang Chen.

Figure 2
Figure 2. Figure 2: A typical model architecture of a LRM. problem — selecting the fastest implementation that satisfies an accuracy constraint — consistently outperforms any uniform policy. LoKA adopts LoKA Dispatch, a unified runtime mechanism that integrates multiple low-precision libraries and dynamically selects the fastest kernel that satisfies both accuracy and throughput constraints. Together, these principles ensure … view at source ↗
Figure 1
Figure 1. Figure 1: LoKA Overview We present LoKA (Low-precision Kernel Applications), a framework designed to unlock the benefits of FP8 and emerging precisions for large-scale recommendation models . LoKA is built on top of three principles ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Significant Relative Log Loss (top) and throughput (bottom) degradation [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compute throughput ablation of low-precision kernels on representative [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LoKA Probe learns and stores necessary parameters online for offline [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Typical behaviors of bias norm of in Wukong training. Biases can [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BlockNorm design padding [44], bias tensors smaller than the world size can incur significant overheads. 2) Block-wise Normalization: Our objective is to fuse nor￾malization directly into the GEMM epilogue to minimize HBM I/O. While similar to epilogue fusion [65], our application is in a different context. By performing normalization immediately after GEMM completion while the output tiles still reside in… view at source ↗
Figure 8
Figure 8. Figure 8: Hard Swish and BlockNorm with sufficiently large block size converges [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Lossless full-trajectory FP8 training of Wukong, Interformer and [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end Speedup of LoKA Training (Left) and Inference (Right) [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: End-to-end latency breakdown of training with and without LoKA [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Scalability of LoKA on Wukong training, varying number of GPUs. N/A: configuration invalid. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Assessing LoKA Mods’ effectiveness on reducing latency of common [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LoKA, a framework for applying FP8 low-precision arithmetic to large recommendation models (LRMs). It relies on three principles: LoKA Probe for online statistical profiling of activations/weights to identify safe FP8 sites via per-layer error quantification, LoKA Mods for reusable model adaptations that enhance numerical stability and efficiency, and LoKA Dispatch for runtime selection of the fastest compliant FP8 kernel. The approach targets LRM-specific issues including numerical sensitivity, small GEMMs followed by normalization, and communication-heavy training environments.

Significance. If the profiling and adaptations prove robust, the work could meaningfully advance low-precision adoption in production-scale recommendation systems by delivering efficiency gains without quality degradation. The co-design emphasis on realistic distributions and hardware-aware modifications offers a structured alternative to kernel-only solutions and may generalize to other sensitive workloads.

major comments (2)
  1. [LoKA Probe description] The central claim that LoKA Probe correctly identifies all safe FP8 sites rests on per-layer statistical benchmarking, yet the description supplies no validation that isolated per-layer error bounds translate to stable end-to-end model quality; cumulative propagation through the embedding-to-logit path and SGD-induced distribution shifts are unaddressed.
  2. [LoKA Mods and LoKA Dispatch] LoKA Mods and Dispatch presuppose that profiled sites remain safe throughout full training runs, but no experiments or analysis demonstrate that the adaptations prevent quality loss under realistic LRM training dynamics (e.g., long-horizon SGD with inter-layer normalization dependencies).
minor comments (2)
  1. The abstract states that LoKA Probe 'quantifies per-layer errors' but does not define the error metric (e.g., relative L2, maximum absolute deviation) or the acceptance threshold used to classify sites as safe.
  2. Clarify how 'fast and slow sites' are distinguished during profiling and whether this classification incorporates both arithmetic throughput and communication costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validation gaps in LoKA Probe and the need for evidence under full training dynamics. We agree these points require strengthening and will revise the manuscript with additional end-to-end experiments and long-horizon analysis while preserving the core co-design contributions.

read point-by-point responses
  1. Referee: [LoKA Probe description] The central claim that LoKA Probe correctly identifies all safe FP8 sites rests on per-layer statistical benchmarking, yet the description supplies no validation that isolated per-layer error bounds translate to stable end-to-end model quality; cumulative propagation through the embedding-to-logit path and SGD-induced distribution shifts are unaddressed.

    Authors: We appreciate this observation. LoKA Probe employs conservative per-layer error quantification under realistic activation distributions precisely to bound potential propagation effects, and its online profiling is meant to track SGD-induced shifts. However, the manuscript does not include explicit end-to-end validation showing that per-layer decisions preserve full-model quality across the embedding-to-logit path. In revision we will add full-training experiments comparing LoKA-enabled models against FP16 baselines, with measurements of cumulative error and quality metrics at multiple training checkpoints. revision: yes

  2. Referee: [LoKA Mods and LoKA Dispatch] LoKA Mods and Dispatch presuppose that profiled sites remain safe throughout full training runs, but no experiments or analysis demonstrate that the adaptations prevent quality loss under realistic LRM training dynamics (e.g., long-horizon SGD with inter-layer normalization dependencies).

    Authors: This is a fair critique. While LoKA Mods are designed to improve numerical stability for normalization-heavy small GEMMs and Dispatch enforces accuracy constraints at runtime, the current text lacks dedicated long-horizon experiments. We will incorporate ablation studies and training curves over extended SGD runs that explicitly track inter-layer normalization dependencies and demonstrate that the combined adaptations maintain model quality without degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: LoKA relies on empirical profiling and co-design without self-referential derivations

full rationale

The paper describes a practical systems framework consisting of LoKA Probe for statistical online benchmarking of activation/weight distributions and per-layer errors, LoKA Mods for model adaptations to improve FP8 stability, and LoKA Dispatch for runtime kernel selection. No equations, uniqueness theorems, or fitted parameters are presented that reduce the central claims to their own inputs by construction. The approach is grounded in external empirical measurements and hardware co-design rather than self-definition or self-citation chains, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on hardware assumptions about FP8 speedups and the feasibility of targeted model adaptations, with new components introduced without external validation in the provided text.

axioms (2)
  • domain assumption FP8 delivers significant speedups on modern GPUs for GEMM operations
    Invoked as the motivation for adopting low precision in the abstract.
  • ad hoc to paper LRM numerical sensitivity can be mitigated by localized model adaptations without global quality loss
    Central to the LoKA Mods component and the claim that co-design expands safe FP8 regions.
invented entities (3)
  • LoKA Probe no independent evidence
    purpose: Online statistical benchmarking to quantify per-layer FP8 errors under realistic distributions
    New profiling method introduced to identify safe FP8 sites.
  • LoKA Mods no independent evidence
    purpose: Reusable model adaptations that improve FP8 numerical stability and efficiency
    New set of modifications proposed as part of the co-design.
  • LoKA Dispatch no independent evidence
    purpose: Runtime selector that chooses fastest FP8 kernel meeting accuracy requirements
    New orchestration component leveraging profiling insights.

pith-pipeline@v0.9.0 · 5642 in / 1500 out tokens · 48319 ms · 2026-05-15T04:52:02.893368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 14 internal anchors

  1. [1]

    Matrix multiplication background user’s guide - nvidia docs,

    “Matrix multiplication background user’s guide - nvidia docs,” [Online; accessed 2025-09-11]. [Online]. Avail- able: https://docs.nvidia.com/deeplearning/performance/dl-performance- matrix-multiplication/index.html#wave-quant

  2. [2]

    The falcon series of open language models,

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, ´Etienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo, “The falcon series of open language models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16867

  3. [3]

    Introduction — amd quark 0.10 documentation,

    AMD, “Introduction — amd quark 0.10 documentation,” [Online; accessed 2025-10-29]. [Online]. Available: https://quark.docs.amd.com/ latest/onnx/tutorial microexponents quantization.html

  4. [4]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y . Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brook...

  5. [5]

    Dot product matrix compression for machine learning,

    Anonymous, “Dot product matrix compression for machine learning,” Technical Disclosure Commons, 2019

  6. [6]

    In: 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

  7. [7]

    Un- derstanding scaling laws for recommendation models,

    N. Ardalani, C.-J. Wu, Z. Chen, B. Bhushanam, and A. Aziz, “Un- derstanding scaling laws for recommendation models,”arXiv preprint arXiv:2208.08489, 2022

  8. [8]

    Quarot: Outlier-free 4-bit inference in rotated llms,

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

  9. [9]

    Halo: Hadamard-assisted lower-precision optimization for llms,

    S. Ashkboos, M. Nikdan, S. Tabesh, R. L. Castro, T. Hoefler, and D. Alistarh, “Halo: Hadamard-assisted lower-precision optimization for llms,”arXiv preprint arXiv:2501.02625, 2025

  10. [10]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016. [Online]. Available: https://arxiv.org/abs/1607.06450

  11. [11]

    Post-training 4-bit quantization of convolution networks for rapid-deployment

    R. Banner, Y . Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” 2019. [Online]. Available: https://arxiv.org/abs/1810.05723

  12. [12]

    Quartet: Native fp4 training can be optimal for large language models,

    R. L. Castro, A. Panferov, S. Tabesh, O. Sieberling, J. Chen, M. Nikdan, S. Ashkboos, and D. Alistarh, “Quartet: Native fp4 training can be optimal for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.14669

  13. [13]

    Algorithms for computing the sample variance: Analysis and recommendations,

    T. F. Chan, G. H. Golub, and R. J. LeVeque, “Algorithms for computing the sample variance: Analysis and recommendations,”The American Statistician, vol. 37, no. 3, pp. 242–247, 1983. 12

  14. [14]

    NetHint: White-Box networking for Multi-Tenant data centers,

    J. Chen, H. Zhang, W. Zhang, L. Luo, J. Chase, I. Stoica, and D. Zhuo, “NetHint: White-Box networking for Multi-Tenant data centers,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, WA: USENIX Association, Apr. 2022, pp. 1327–1343. [Online]. Available: https: //www.usenix.org/conference/nsdi22/presentation/chen-jingrong

  15. [15]

    Adaptive factorization network: Learning adaptive-order feature interactions,

    W. Cheng, Y . Shen, and L. Huang, “Adaptive factorization network: Learning adaptive-order feature interactions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3609– 3616

  16. [16]

    arXiv preprint arXiv:2505.19115 , year=

    B. Chmiel, M. Fishman, R. Banner, and D. Soudry, “Fp4 all the way: Fully quantized training of llms,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19115

  17. [17]

    Deepgemm: clean and efficient fp8 gemm kernels with fine-grained scaling,

    DeepSeek, “Deepgemm: clean and efficient fp8 gemm kernels with fine-grained scaling,” [Online; accessed 2025-08-29]. [Online]. Available: https://github.com/deepseek-ai/DeepGEMM

  18. [18]

    Deepgemm numerical test,

    DeepSeek-AI, “Deepgemm numerical test,” https://github.com/deepseek- ai/DeepGEMM/blob/main/tests/test bf16.py#L38, [Accessed 17-02- 2026]

  19. [19]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

  20. [20]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI, A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, C. Lu, C. Zhao, C. Deng, C. Xu, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, E. Li, F. Zhou, F. Lin, F. Dai, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Li, H. Liang, H. Wei, H. Zhang, H. Luo, H. Ji, H. Ding, H. Tang, H. Cao, H. Gao, H. Qu, H. Zeng, J. Huang, J. L...

  21. [21]

    From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

    K. Deng, H. Zheng, M. Qing, K. Zhu, G. Li, Y . Xiao, L. E. Zhang, L. Guo, B. Hui, Y . Wang, G. Yuan, G. Agrawal, W. Niu, and X. Ma, “From bits to chips: An llm-based hardware-aware quantization agent for streamlined deployment of llms,” 2026. [Online]. Available: https://arxiv.org/abs/2601.03484

  22. [22]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022. [Online]. Available: https://arxiv.org/abs/2208.07339

  23. [23]

    Nvfp4 trains with precision of 16-bit and speed and effi- ciency of 4-bit — nvidia technical blog,

    K. Devleker, “Nvfp4 trains with precision of 16-bit and speed and effi- ciency of 4-bit — nvidia technical blog,” 8 2025, [Online; accessed 2025- 08-31]. [Online]. Available: https://developer.nvidia.com/blog/nvfp4- trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

  24. [24]

    Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels,

    R.-G. Dumitru, V . Yadav, R. Maheshwary, P.-I. Clotan, S. T. Madhusudhan, and M. Surdeanu, “Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels,”

  25. [25]

    Available: https://arxiv.org/abs/2406.17415

    [Online]. Available: https://arxiv.org/abs/2406.17415

  26. [26]

    Learned step size quantization,

    S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” 2020. [Online]. Available: https://arxiv.org/abs/1902.08153

  27. [27]

    Adaptive gradient quantization for data-parallel sgd,

    F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. Roy, and A. Ramezani- Kebrya, “Adaptive gradient quantization for data-parallel sgd,” 2020. [Online]. Available: https://arxiv.org/abs/2010.12460

  28. [28]

    Fbgemm numerical test,

    FBGEMM, “Fbgemm numerical test,” https://github.com/pytorch/ FBGEMM/blob/main/fbgemm gpu/test/quantize/fused 8bit rowwise test.py#L61)rely, [Accessed 17-02-2026]

  29. [29]

    Enabling float8 all-gather in fsdp2 - distributed - pytorch developer mailing list,

    W. Feng, “Enabling float8 all-gather in fsdp2 - distributed - pytorch developer mailing list,” 8 2024, [Online; accessed 2025-10-29]. [Online]. Available: https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in- fsdp2/2359

  30. [30]

    Scaling fp8 training to trillion-token llms,

    M. Fishman, B. Chmiel, R. Banner, and D. Soudry, “Scaling fp8 training to trillion-token llms,”arXiv preprint arXiv:2409.12517, 2024

  31. [31]

    Deck: Experiences on delta checkpointing for industrial recommendation systems,

    X. Gao, S. Acharya, S. Han, Y . Ren, Y . Zhao, L. Luo, C. Wang, P. Fernando, S. Mishra, S. Yanet al., “Deck: Experiences on delta checkpointing for industrial recommendation systems,”Proceedings of the VLDB Endowment, vol. 18, no. 12, pp. 4978–4990, 2025

  32. [32]

    Vip5: Towards multimodal foundation models for recommendation,

    S. Geng, J. Tan, S. Liu, Z. Fu, and Y . Zhang, “Vip5: Towards multimodal foundation models for recommendation,”arXiv preprint arXiv:2305.14302, 2023

  33. [33]

    A survey of quantization methods for efficient neural network inference,

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,”

  34. [34]

    Available: https://arxiv.org/abs/2103.13630

    [Online]. Available: https://arxiv.org/abs/2103.13630

  35. [35]

    On the embedding collapse when scaling up recommendation models,

    X. Guo, J. Pan, X. Wang, B. Chen, J. Jiang, and M. Long, “On the embedding collapse when scaling up recommendation models,”arXiv preprint arXiv:2310.04400, 2023

  36. [36]

    Matrix algebra from a statistician’s perspective,

    D. A. Harville, “Matrix algebra from a statistician’s perspective,” 1998

  37. [37]

    Towards fully fp8 gemm llm training at scale,

    A. Hern´andez-Cano, D. Garbaya, I. Schlag, and M. Jaggi, “Towards fully fp8 gemm llm training at scale,”arXiv preprint arXiv:2505.20524, 2025

  38. [38]

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” 2016. [Online]. Available: https: //arxiv.org/abs/1609.07061

  39. [39]

    Torchrec: a pytorch domain library for recommendation systems,

    D. Ivchenko, D. Van Der Staay, C. Taylor, X. Liu, W. Feng, R. Kindi, A. Sudarshan, and S. Sefati, “Torchrec: a pytorch domain library for recommendation systems,” inProceedings of the 16th ACM Conference on Recommender Systems, 2022, pp. 482–483

  40. [40]

    Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” 2017. [Online]. Available: https://arxiv.org/abs/1712.05877

  41. [41]

    Massive values in self-attention modules are the key to contextual knowledge understanding,

    M. Jin, K. Mei, W. Xu, M. Sun, R. Tang, M. Du, Z. Liu, and Y . Zhang, “Massive values in self-attention modules are the key to contextual knowledge understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01563

  42. [42]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020. 13

  43. [43]

    Fbgemm: Enabling high-performance low- precision deep learning inference,

    D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyanskiy, “Fbgemm: Enabling high-performance low- precision deep learning inference,” 2021. [Online]. Available: https: //arxiv.org/abs/2101.05615

  44. [44]

    Scout before you attend: Sketch-and-walk sparse attention for efficient llm inference,

    H. A. D. Le, S. Joshi, Z. Yang, Z. Xu, and A. Shrivastava, “Scout before you attend: Sketch-and-walk sparse attention for efficient llm inference,” 2026. [Online]. Available: https://arxiv.org/abs/2602.07397

  45. [45]

    External large foundation model: How to efficiently serve trillions of parameters for online ads recommendation,

    M. Liang, X. Liu, R. Jin, B. Liu, Q. Suo, Q. Zhou, S. Zhou, L. Chen, H. Zheng, Z. Li, S. Jiang, J. Yang, X. Xia, F. Yang, Y . Badr, E. Wen, S. Xu, H. Chen, Z. Zhang, J. Nie, C. Yang, Z. Zeng, W. Zhang, X. Huang, Q. Li, S. Wang, E. Lyu, W. Lu, R. Zhang, W. Wang, J. Rudy, M. Hang, K. Wang, Y . Ma, S. Wang, S. Zeng, T. Tang, X. Wei, L. Jin, J. Zhang, M. Chen...

  46. [46]

    Available: https://arxiv.org/abs/2502.17494

    [Online]. Available: https://arxiv.org/abs/2502.17494

  47. [47]

    June 7, 2025.DOI:10.48550/arXiv.2410.06511

    W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos, “Torchtitan: One-stop pytorch native solution for production ready llm pre-training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06511

  48. [48]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” inProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 87–100. [Online]. Available: https://proceedings.mlsys.o...

  49. [49]

    Parameter hub: a rack-scale parameter server for distributed deep neural network training,

    L. Luo, J. Nelson, L. Ceze, A. Phanishayee, and A. Krishnamurthy, “Parameter hub: a rack-scale parameter server for distributed deep neural network training,” inProceedings of the ACM Symposium on Cloud Computing, 2018, pp. 41–54

  50. [50]

    Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud,

    L. Luo, P. West, J. Nelson, A. Krishnamurthy, and L. Ceze, “Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud,”Proceedings of Machine Learning and Systems, vol. 2, pp. 82–97, 2020

  51. [51]

    Disaggregated multi- tower: Topology-aware modeling technique for efficient large-scale recommendation,

    L. Luo, B. Zhang, M. Tsang, Y . Ma, C.-H. Chu, Y . Chen, S. Li, Y . Hao, Y . Zhao, G. Lakshminarayananet al., “Disaggregated multi- tower: Topology-aware modeling technique for efficient large-scale recommendation,”arXiv preprint arXiv:2403.00877, 2024

  52. [52]

    Mixed Precision Training

    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” 2018. [Online]. Available: https://arxiv.org/abs/1710.03740

  53. [53]

    High-performance, distributed training of large-scale deep learning recommendation models,

    D. Mudigere, Y . Hao, J. Huang, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luoet al., “High-performance, distributed training of large-scale deep learning recommendation models,”arXiv preprint arXiv:2104.05158, 2021

  54. [54]

    A white paper on neural network quantization,

    M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. van Baalen, and T. Blankevoort, “A white paper on neural network quantization,”

  55. [55]

    Available: https://arxiv.org/abs/2106.08295

    [Online]. Available: https://arxiv.org/abs/2106.08295

  56. [57]

    Deep Learning Recommendation Model for Personalization and Recommendation Systems

    M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzoliniet al., “Deep learning recommendation model for personalization and recommendation systems,”arXiv preprint arXiv:1906.00091, 2019

  57. [58]

    1. nvidia ampere gpu architecture tuning guide — ampere tuning guide 13.0 documentation,

    Nvidia, “1. nvidia ampere gpu architecture tuning guide — ampere tuning guide 13.0 documentation,” [Online; accessed 2025-09-12]. [Online]. Available: https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html

  58. [59]

    [Online]

    NVIDIA, “Github - nvidia/transformerengine: A library for accelerating transformer models on nvidia gpus, including using 8-bit floating point (fp8) precision on hopper, ada and blackwell gpus, to provide better performance with lower memory utilization in both training and inference.” [Online; accessed 2025-10-29]. [Online]. Available: https://github.com...

  59. [60]

    arXiv preprint arXiv:2509.25149 , year=

    NVIDIA, F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, I. Buck, B. Catanzaro, J. Choi, M. Chrzanowski, E. Chung, V . Cui, S. Dai, B. D. Rouhani, C. del Mundo, D. Donia, B. Eryilmaz, H. Estela, A. Goel, O. Goncharov, Y . Guvvala, R. Hesse, R. Hewett, H. Hum, U. Kapasi, B. Khailany, ...

  60. [61]

    Nvidia a100 gpu datasheet,

    NVIDIA Corporation, “Nvidia a100 gpu datasheet,” [Online; accessed 2025-08-25]. [Online]. Available: https://www.nvidia.com/content/dam/ en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

  61. [62]

    Nvidia b200 gpu datasheet,

    ——, “Nvidia b200 gpu datasheet,” [Online; accessed 2025-08-25]. [Online]. Available: https://nvdam.widen.net/s/wwnsxrhm2w/blackwell- datasheet-3384703

  62. [63]

    Nvidia h100 gpu datasheet,

    ——, “Nvidia h100 gpu datasheet,” [Online; accessed 2025-08- 25]. [Online]. Available: https://resources.nvidia.com/en-us-hopper- architecture/nvidia-tensor-core-gpu-datasheet?ncid=no-ncid

  63. [64]

    Torchao: Pytorch-native training-to-serving model optimization,

    A. Or, A. Jain, D. Vega-Myhre, J. Cai, C. D. Hernandez, Z. Zheng, D. Guessous, V . Kuznetsov, C. Puhrsch, M. Saroufim, S. Rao, T. Tran, and A. Samard ˇzi´c, “Torchao: Pytorch-native training-to-serving model optimization,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16099

  64. [65]

    Evaluating model performance with hard-swish activation function adjustments,

    S. A. Pydimarry, S. M. Khairnar, S. G. Palacios, G. Sankaranarayanan, D. Hoagland, D. Nepomnayshy, and H. P. Nguyen, “Evaluating model performance with hard-swish activation function adjustments,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06879

  65. [66]

    Swish: a self-gated activation function,

    P. Ramachandran, B. Zoph, and Q. V . Le, “Swish: a self-gated activation function,”arXiv: Neural and Evolutionary Computing, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:196158220

  66. [67]

    arXiv preprint arXiv:2310.10537 , year=

    B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V . Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrill...

  67. [68]

    Improving training stability for multitask ranking models in recommender systems,

    J. Tang, Y . Drori, D. Chang, M. Sathiamoorthy, J. Gilmer, L. Wei, X. Yi, L. Hong, and E. H. Chi, “Improving training stability for multitask ranking models in recommender systems,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 4882–48...

  68. [69]

    Triton: an intermediate language and compiler for tiled neural network computations,

    P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10...

  69. [70]

    Torchao blockwise triton test,

    TorchAO, “Torchao blockwise triton test,” https://github.com/pytorch/ ao/blob/main/test/kernel/test blockwise triton.py#L55, [Accessed 17-02- 2026]

  70. [71]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  71. [72]

    Haq: Hardware-aware automated quantization with mixed precision,

    K. Wang, Z. Liu, Y . Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8612–8620

  72. [73]

    arXiv preprint arXiv:2501.17116 , year=

    R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z. Zha, 14 and P. Cheng, “Optimizing large language model training using fp4 quantization,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17116

  73. [74]

    Dcn v2: Improved deep & cross network and practical lessons for web- scale learning to rank systems,

    R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep & cross network and practical lessons for web- scale learning to rank systems,” inProceedings of the web conference 2021, 2021, pp. 1785–1797

  74. [75]

    Group Normalization

    Y . Wu and K. He, “Group normalization,” 2018. [Online]. Available: https://arxiv.org/abs/1803.08494

  75. [76]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” 2024. [Online]. Available: https: //arxiv.org/abs/2211.10438

  76. [77]

    Training deep learning recommendation model with quantized collective communications,

    J. A. Yang, J. Park, S. Sridharan, and P. T. P. Tang, “Training deep learning recommendation model with quantized collective communications,” in Conference on Knowledge Discovery and Data Mining (KDD), 2020, p. 95

  77. [78]

    Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction,

    Z. Zeng, X. Liu, M. Hang, X. Liu, Q. Zhou, C. Yang, Y . Liu, Y . Ruan, L. Chen, Y . Chen, Y . Hao, J. Xu, J. Nie, X. Liu, B. Zhang, W. Wen, S. Yuan, K. Wang, W.-Y . Chen, Y . Han, H. Li, C. Yang, B. Long, P. S. Yu, H. Tong, and J. Yang, “Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction,” 2024. [Online]. A...

  78. [79]

    Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,

    D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y . Hu, J. Nie, Y . Huang, Y . Tian, A. Kejariwalet al., “Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models,”Proceedings of Machine Learning and Systems, vol. 5, 2023

  79. [80]

    Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,

    J. Zhai, L. Liao, X. Liu, Y . Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y . Lu, and Y . Shi, “Actions speak louder than words: Trillion- parameter sequential transducers for generative recommendations,” in Proceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kol...

  80. [81]

    Root mean square layer normalization,

    B. Zhang and R. Sennrich, “Root mean square layer normalization,”

Showing first 80 references.