pith. sign in

arxiv: 2606.01117 · v1 · pith:3NFOURCFnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Pith reviewed 2026-06-28 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords extreme multi-label classificationsparse traininghardware-aware sparsityoutput layerXMCfixed fan-ingroup-shared patternslong-tailed labels
0
0 comments X

The pith

Group-shared fixed fan-in sparsity turns arithmetic savings into 4.4× forward and 25× backward speedups for million-label XMC while matching dense precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that extreme multi-label classification output layers can use a semi-structured sparsity pattern where semantically related labels share the same input connections but keep separate weights. This grouping cuts index memory, boosts feature reuse, and supports custom GPU kernels that deliver actual wall-clock gains rather than just lower FLOPs counts. Splitting the layer into a small dense head for frequent labels and a group-shared sparse tail for the rest supplies informative gradients without auxiliary losses. If correct, models over very large label sets become faster to train and run at inference while staying close to dense accuracy.

Core claim

Group-shared fixed fan-in sparsity is a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This design reduces index memory overhead, increases feature reuse across labels, and enables efficient GPU execution via custom CUDA kernels. Combined with a decomposition of the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, the method achieves up to 4.4× speedup in the forward pass and up to 25× speedup in backward passes over standard fixed fan-in sparsity, operating within a few percent of a FLOPs-matched dense bottleneck and matching or improving pr

What carries the argument

group-shared fixed fan-in sparsity: a semi-structured output-layer pattern where semantically related labels share identical sparse input connections but hold independent weights, realized through custom CUDA kernels that exploit modern accelerator primitives.

If this is right

  • Arithmetic reductions translate into up to 4.4× forward and 25× backward wall-clock speedups over standard fixed fan-in sparsity.
  • Precision@k matches or exceeds prior sparse baselines across large-scale XMC benchmarks.
  • The performance gap to dense models narrows while retaining the memory benefits of sparsity.
  • Custom kernels leveraging accelerator primitives achieve efficient execution within a few percent of a FLOPs-matched dense bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grouping principle could extend to other large-output regimes such as language-model vocabularies or recommender systems.
  • Dynamic, learned grouping of labels might further improve results over fixed semantic groupings.
  • The approach suggests that hardware-specific kernel design should be considered earlier in the design of sparse layers rather than as a post-hoc optimization.

Load-bearing premise

Semantically related labels can be grouped to share identical sparse input patterns without meaningful loss in model capacity or gradient quality, and the long-tailed label distribution permits an effective split into a small dense head and a group-shared sparse tail.

What would settle it

A direct wall-clock timing measurement on a standard GPU for a 1-million-label XMC model showing that the custom kernels deliver less than 2× backward speedup over a dense baseline of equal FLOPs.

Figures

Figures reproduced from arXiv: 2606.01117 by Erik Schultheis, Jean Lucien Randrianantenaina, Jinbin Zhang, Nasib Ullah, Rohit Babbar.

Figure 1
Figure 1. Figure 1: Architecture and Performance Overview. (a) Original SPARTEX. (b) Proposed Group-Shared architecture. (c) Performance benchmarks vs. FLOPs-matched dense (bottleneck) baseline. (d) Visual comparison of sparsity patterns showing trade-offs between expressiveness and speed. to a fixed number of input features. This design reduces representational memory overhead and provides uniform load balancing across label… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Group-shared Fixed fan-in format with Fixed Fan in sparsity and standard Compressed sparse column (CSC) and COO format 3. Hardware-Aware Sparse Training Problem Setup. We consider extreme multi-label classifica￾tion with L labels and an encoder fθ that maps an input x to a feature representation h = fθ(x) ∈ R H. Predictions are produced by a linear output layer with weights W ∈ R L×H, where L… view at source ↗
Figure 3
Figure 3. Figure 3: Indexing memory overhead as label size increases for different forms of sparsity. Label Grouping. Labels in a group are merged together to ensure that semantically related labels belong to the same group. Since group-shared fixed fan-in sparsity requires labels within a group to share a common fan-in index set, these group-wise labels are encouraged to attend to similar subsets of input features. Concretel… view at source ↗
Figure 4
Figure 4. Figure 4: Kernel benchmarking of forward, gradient of weights and gradient of features for batch size 64, feature size 768, and, label group size of 32 on A100 GPU. (a) kernel wall clock time at different label size from 64K up to 9 millions for fan-in value of 32 (96% sparsity). (b) kernel wall clock time at different sparsity level from 83% to 96% at label size 670K. dense bottleneck our method outperforms. Compar… view at source ↗
Figure 5
Figure 5. Figure 5: Label Group size vs Kernel wall clock time on A100 GPU. Effect of Label Group Size (G). The group size G controls the trade-off between representational flexibility and kernel efficiency. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias -- encouraging related labels to share feature subsets -- while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to $4.4\times$ speedup in the forward pass and up to $25\times$ speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HASTE, a hardware-aware approach for extreme multi-label classification (XMC) over large output spaces. It introduces group-shared fixed fan-in sparsity, where semantically related labels share identical sparse input patterns (while retaining independent weights), combined with a decomposition of the output layer into a small dense head over frequent labels and a group-shared sparse tail. Custom CUDA kernels are used to translate the sparsity into wall-clock speedups (up to 4.4× forward, 25× backward over standard fixed fan-in), while claiming to operate within a few percent of a FLOPs-matched dense model and to match or exceed prior sparse baselines on precision@k across XMC benchmarks.

Significance. If the empirical claims hold under rigorous controls, the work would demonstrate a practical route from arithmetic sparsity to hardware-efficient execution in large-output models without auxiliary losses, addressing a key bottleneck in XMC and related domains. The explicit use of long-tailed structure and custom kernels for feature reuse is a concrete strength.

major comments (2)
  1. [Abstract] The central claim that group-shared fixed fan-in 'encourages related labels to share feature subsets' and yields the reported precision@k gains rests on the untested assumption that semantically coherent grouping outperforms random or frequency-based alternatives. No ablation or quantitative comparison of grouping strategies is described in the provided text, leaving open whether the inductive bias or simply the reduced index overhead drives the results.
  2. [Abstract] The abstract states that the method operates 'within a few percent of a FLOPs-matched dense bottleneck' and achieves the cited speedups, but provides no details on how the dense baseline is constructed, whether error bars or multiple runs are reported, or how the long-tailed split threshold is chosen. These omissions make it impossible to assess whether the arithmetic-to-wall-clock translation is robust or sensitive to post-hoc choices.
minor comments (2)
  1. [Abstract] The abstract mentions 'large-scale XMC benchmarks' but does not name the specific datasets or prior sparse baselines used for comparison.
  2. [Abstract] Kernel-level microbenchmarking is referenced but no table or figure numbers are given for the 4.4× / 25× figures or the 'few percent' gap to dense.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The central claim that group-shared fixed fan-in 'encourages related labels to share feature subsets' and yields the reported precision@k gains rests on the untested assumption that semantically coherent grouping outperforms random or frequency-based alternatives. No ablation or quantitative comparison of grouping strategies is described in the provided text, leaving open whether the inductive bias or simply the reduced index overhead drives the results.

    Authors: We agree that an explicit ablation comparing semantic grouping against random and frequency-based alternatives would strengthen the evidence for the inductive bias. The manuscript motivates the grouping via label co-occurrence and embedding similarity to exploit semantic relatedness in XMC, but does not quantify its advantage over alternatives. In the revision we will add a controlled ablation on representative benchmarks reporting precision@k and wall-clock time for semantic, random, and frequency-based groupings under matched sparsity. revision: yes

  2. Referee: [Abstract] The abstract states that the method operates 'within a few percent of a FLOPs-matched dense bottleneck' and achieves the cited speedups, but provides no details on how the dense baseline is constructed, whether error bars or multiple runs are reported, or how the long-tailed split threshold is chosen. These omissions make it impossible to assess whether the arithmetic-to-wall-clock translation is robust or sensitive to post-hoc choices.

    Authors: The full experimental section details the FLOPs-matched dense baseline (hidden dimension adjusted to equalize total FLOPs), reports mean and standard deviation over five random seeds, and selects the long-tailed threshold by cumulative label frequency to retain the top 1% of labels in the dense head. These elements are not summarized in the abstract. We will revise the abstract to briefly note the baseline construction, multi-run reporting, and threshold selection criterion while preserving length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on kernel implementation and empirical validation rather than self-referential fits or definitions.

full rationale

The abstract and description introduce group-shared fixed fan-in sparsity as a design choice with an inductive bias, validated by custom CUDA kernels, microbenchmarking showing 4.4×/25× speedups, and XMC benchmark results matching or improving precision@k. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce any central claim to its own inputs by construction. The long-tailed decomposition and grouping are presented as architectural decisions, not derived quantities. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the method rests on the domain assumption that label semantics permit useful grouping and that long-tailed frequency structure is exploitable, but no explicit free parameters or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5809 in / 1187 out tokens · 27330 ms · 2026-06-28T17:32:32.515514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Wulf, Wm. A. and McKee, Sally A. , title =. 1995 , issue_date =. doi:10.1145/216585.216588 , journal =

  2. [2]

    Machine Learning , volume=

    Bonsai: diverse and shallow trees for extreme multi-label classification , author=. Machine Learning , volume=. 2020 , publisher=

  3. [3]

    Machine Learning , volume=

    Data scarcity, robustness and extreme multi-label classification , author=. Machine Learning , volume=. 2019 , publisher=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Sign-in to the lottery: Reparameterizing sparse training , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

    Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

  6. [6]

    and Dahiya, K

    Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M. , title =

  7. [7]

    Proceedings of the ACM on Web Conference 2025 , pages=

    Unidec: Unified dual encoder and classifier training for extreme multi-label classification , author=. Proceedings of the ACM on Web Conference 2025 , pages=

  8. [8]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Navigating Extremes: Dynamic Sparsity in Large Output Spaces , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  9. [9]

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=

    Towards memory-efficient training for extremely large output spaces--learning with 670k labels on a single commodity gpu , author=. Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages=. 2023 , organization=

  10. [10]

    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Inceptionxml: A lightweight framework with synchronized negative sampling for short text extreme classification , author=. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  11. [11]

    Proceedings of the 2018 World Wide Web Conference , pages=

    Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising , author=. Proceedings of the 2018 World Wide Web Conference , pages=

  12. [12]

    Proceedings of the tenth ACM international conference on web search and data mining , pages=

    Dismec: Distributed sparse machines for extreme multi-label classification , author=. Proceedings of the tenth ACM international conference on web search and data mining , pages=

  13. [13]

    Proceedings of Machine Learning and Systems , volume=

    Renee: End-to-end training of extreme classification models , author=. Proceedings of Machine Learning and Systems , volume=

  14. [14]

    Forty-second International Conference on Machine Learning , year =

    ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces , author=. Forty-second International Conference on Machine Learning , year =

  15. [15]

    Advances in neural information processing systems , volume=

    Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification , author=. Advances in neural information processing systems , volume=

  16. [16]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  17. [17]

    International conference on machine learning , pages=

    Siamesexml: Siamese networks meet extreme classifiers with 100m labels , author=. International conference on machine learning , pages=. 2021 , organization=

  18. [18]

    Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

    Taming pretrained transformers for extreme multi-label text classification , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Fast multi-resolution transformer fine-tuning for extreme multi-label text classification , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Advances in neural information processing systems , volume=

    Cascadexml: Rethinking transformers for end-to-end multi-resolution training in extreme multi-label classification , author=. Advances in neural information processing systems , volume=

  21. [21]

    Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining , pages=

    Ngame: Negative mining-aware mini-batching for extreme classification , author=. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining , pages=

  22. [22]

    Machine Learning , volume=

    Meta-classifier free negative sampling for extreme multilabel classification , author=. Machine Learning , volume=. 2024 , publisher=

  23. [23]

    Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Deep encoders with auxiliary parameters for extreme classification , author=. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  24. [24]

    Proceedings of the Web Conference 2021 , pages=

    ECLARE: Extreme classification with label graph correlations , author=. Proceedings of the Web Conference 2021 , pages=

  25. [25]

    International Conference on Machine Learning , pages=

    Dense for the price of sparse: Improved performance of sparsely initialized networks via a subspace offset , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Generalized test utilities for long-tail performance in extreme multi-label classification , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    ICLR , year=

    Dynamic Sparse Training with Structured Sparsity , author=. ICLR , year=

  28. [28]

    arXiv preprint arXiv:2507.03117 , year=

    BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers , author=. arXiv preprint arXiv:2507.03117 , year=

  29. [29]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Venom: A vectorized n: M format for unleashing the power of sparse tensor cores , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

  30. [30]

    International Conference on Learning Representations , year =

    Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , author=. International Conference on Learning Representations , year =

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    S-ste: Continuous pruning function for efficient 2: 4 sparse pre-training , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    Proceedings of Machine Learning and Systems , volume=

    Efficient gpu kernels for n: M-sparse weights in deep learning , author=. Proceedings of Machine Learning and Systems , volume=

  33. [33]

    International conference on machine learning , pages=

    Rigging the lottery: Making all tickets winners , author=. International conference on machine learning , pages=. 2020 , organization=

  34. [34]

    Nature communications , volume=

    Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , author=. Nature communications , volume=. 2018 , publisher=

  35. [35]

    The State of Sparsity in Deep Neural Networks

    The state of sparsity in deep neural networks , author=. arXiv preprint arXiv:1902.09574 , year=

  36. [36]

    Communications of the ACM , volume=

    The hardware lottery , author=. Communications of the ACM , volume=. 2021 , publisher=

  37. [37]

    Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , author=

  38. [38]

    arXiv preprint arXiv:2402.00025 , year=

    Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition , author=. arXiv preprint arXiv:2402.00025 , year=

  39. [39]

    Proceedings of the ACM on Programming Languages , volume=

    SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention , author=. Proceedings of the ACM on Programming Languages , volume=. 2025 , publisher=

  40. [40]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    How Well Calibrated are Extreme Multi-label Classifiers? An Empirical Analysis , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1 , pages=