pith. machine review for the scientific record. sign in

arxiv: 2604.26039 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI· cs.DC

Recognition: unknown

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords mixture-of-expertskernel configurationruntime dispatchperformance modelingpolymorphic kernelsinference servingCUDA optimizationmegakernel
0
0 comments X

The pith

A four-parameter wave cost model selects near-optimal kernel configurations for Mixture-of-Experts inference from the runtime expert histogram.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts inference loses 10-70% of kernel throughput when dispatch chooses configurations from batch size alone, ignoring the distribution of routed experts. RaMP supplies a routing-aware framework whose performance-region analysis derives optimization rules from hardware constants alone. The framework then applies a four-parameter wave cost model, fitted once in 10-24 minutes of profiling, to pick the fastest configuration at runtime from the expert histogram. Because the model depends only on CTA grid geometry, the same selection logic works across kernels without source changes.

Core claim

The paper establishes that a performance-region analysis derived solely from hardware constants correctly predicts when each optimization helps on all eight tested architectures, including three previously unseen. From this foundation, a four-parameter wave cost model selects the fastest polymorphic configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search after brief one-time profiling. When combined with a CuTe DSL megakernel that exposes 134-268 configurations, the method produces 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving.

What carries the argument

The four-parameter wave cost model that estimates kernel execution time from CTA grid geometry and the runtime expert routing histogram to choose among polymorphic configurations.

If this is right

  • Static batch-size-only dispatch leaves 10-70% of attainable kernel throughput unrealized in MoE serving.
  • RaMP delivers 1.22x kernel speedup and 1.30x end-to-end speedup over Triton, DeepGEMM, and FlashInfer baselines.
  • The same selection logic transfers to unmodified kernels such as Alpha-MoE, producing 1.14x improvement.
  • Hardware-constant predictions hold for all eight evaluated architectures without per-architecture retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Histogram-driven selection could extend to other sparse workloads whose optimal kernels also vary with activation patterns.
  • Compiler integration might reduce the one-time profiling step to near-zero for new model variants.
  • Online histogram collection could support per-request adaptation when serving mixes of models on shared hardware.

Load-bearing premise

That a four-parameter model based only on CTA grid geometry can accurately rank kernel configurations across different expert routing distributions.

What would settle it

Recording a mean regret substantially above 0.93% when the fitted wave cost model is applied to a new MoE architecture or GPU not used during the initial 10-24 minute profiling.

Figures

Figures reproduced from arXiv: 2604.26039 by Debajyoti Datta, Vyom Sharma.

Figure 1
Figure 1. Figure 1: CTA tile allocation under static vs. routing-aware dispatch for view at source ↗
Figure 3
Figure 3. Figure 3: Wave utilization ω vs. routing balancedness β for OLMoE (E=64, bm = 16). Skewed routing (lower β) frag￾ments the CTA grid into partially filled waves, reducing SM occupancy by 15–30% for small batches. Shaded bands show ±1σ over 300 routing samples. The red region marks the typical operating range from view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of routing balancedness β measured across real inference workloads. Both models concentrate near β ≈ 0.5, far from the uniform β=1.0 point where static dispatch is tuned (red dashed line). The “mismatch” arrow highlights the gap between actual and assumed operating regime. III. CHALLENGES AND OPPORTUNITIES Before presenting our solution, we characterize the static dispatch problem quantitative… view at source ↗
Figure 5
Figure 5. Figure 5: System overview. Offline: enumerate valid configs, JIT-compile and profile each at 25 operating points, then fit OLS cost coefficients. Online: a single fused Triton kernel performs expert bincount, cost evaluation, and argmin; the result selects the pre-compiled CuTe DSL kernel binary. Total online overhead: <2.5 µs amortized per MoE layer. buckets. However, this approach still misses routing variation wi… view at source ↗
Figure 6
Figure 6. Figure 6: Split-K fills idle SMs at sub-wave operating points. view at source ↗
Figure 7
Figure 7. Figure 7: Regime classification of 8 MoE models in the view at source ↗
Figure 8
Figure 8. Figure 8: (a) Cost model predicted time for four bm values on OLMoE vs. batch size S. At small S, bm=8 achieves the lowest cost (minimal padding); at large S, bm=64 wins (fewer waves). (b) Wave staircase for bm=16: time jumps discretely at wave boundaries (SM=132). CuTe DSL compiles each configuration in ∼5 s via TVM-FFI JIT. D. Runtime Dispatch At serving time ( view at source ↗
read the original abstract

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RaMP, a routing-aware dispatch framework for Mixture-of-Experts inference. It features a performance-region analysis derived from hardware constants alone that predicts when each optimization helps and correctly forecasts behavior across all 8 tested architectures (including 3 unseen). A four-parameter wave cost model, fitted via 10-24 minutes of one-time profiling per model, selects the fastest configuration from the runtime expert histogram and achieves 0.93% mean regret versus exhaustive search. The model depends only on CTA grid geometry, making it kernel-agnostic; when paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP yields 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving (1.41x over DeepGEMM, 1.13x over FlashInfer CUTLASS).

Significance. If the performance-region analysis and low-regret selection hold, the work could meaningfully advance efficient MoE serving by closing the 10-70% throughput gap left by batch-size-only dispatch. The kernel-agnostic property, low profiling overhead, and demonstrated speedups on multiple architectures (including application to Alpha-MoE with no source changes) are strengths that would support practical adoption in production inference systems.

major comments (3)
  1. [Performance-region analysis] Performance-region analysis: the central claim that this analysis derives purely from hardware constants and correctly predicts optimization benefits on all 8 architectures (including 3 unseen) is load-bearing for both the kernel-agnostic property and the reported speedups, yet no derivation steps, explicit equations, or list of constants appear in the manuscript. This leaves the generalization risk unaddressed.
  2. [Wave cost model] Four-parameter wave cost model: the model is fitted directly to profiling data collected on the target hardware, creating a circularity burden for the 0.93% mean regret claim; the fitting/validation procedure (including how the four parameters were chosen and whether cross-hardware testing was performed) must be detailed to confirm it is not post-hoc tuning.
  3. [Experimental results] Empirical evaluation: the abstract and results report concrete speedups (1.22x kernel, 1.30x end-to-end) and low regret without error bars, number of runs, or statistical significance tests; the tables or figures presenting these numbers should include variance to allow assessment of robustness.
minor comments (2)
  1. The abstract would be clearer if it briefly defined 'megakernel polymorphism' and 'CTA grid geometry' on first use.
  2. [Wave cost model] Notation for the wave cost model parameters is introduced without an accompanying equation or table listing their values across architectures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [Performance-region analysis] Performance-region analysis: the central claim that this analysis derives purely from hardware constants and correctly predicts optimization benefits on all 8 architectures (including 3 unseen) is load-bearing for both the kernel-agnostic property and the reported speedups, yet no derivation steps, explicit equations, or list of constants appear in the manuscript. This leaves the generalization risk unaddressed.

    Authors: We agree that the manuscript lacks sufficient detail on the derivation. The performance-region analysis is constructed from a roofline comparison of each configuration's arithmetic intensity against hardware constants (peak FP16 throughput, memory bandwidth, L2 cache size, and CTA occupancy limits) obtained from vendor specifications. Regions are delineated by the balance point where memory-bound vs. compute-bound behavior changes. We will add a new subsection (or appendix) containing the explicit equations, the full list of constants for all eight architectures, and the step-by-step prediction procedure that was validated on the three unseen architectures. revision: yes

  2. Referee: [Wave cost model] Four-parameter wave cost model: the model is fitted directly to profiling data collected on the target hardware, creating a circularity burden for the 0.93% mean regret claim; the fitting/validation procedure (including how the four parameters were chosen and whether cross-hardware testing was performed) must be detailed to confirm it is not post-hoc tuning.

    Authors: The four parameters map directly to observable quantities (wave launch overhead, per-CTA compute time, per-CTA memory time, and synchronization cost) and are fitted once via least-squares on a modest set of representative histograms collected in 10-24 minutes. The 0.93% regret is computed against exhaustive search on the identical hardware and workload distribution, which is the correct baseline for a runtime selector. We will expand the manuscript with the exact fitting procedure, rationale for the four-parameter form, cross-validation protocol (hold-out histograms), and results of applying the model structure across architectures (with per-hardware refitting of coefficients). revision: yes

  3. Referee: [Experimental results] Empirical evaluation: the abstract and results report concrete speedups (1.22x kernel, 1.30x end-to-end) and low regret without error bars, number of runs, or statistical significance tests; the tables or figures presenting these numbers should include variance to allow assessment of robustness.

    Authors: We concur that variance information improves assessment of robustness. All reported speedups and regret figures are means over at least five independent runs per configuration. We will update the relevant tables and figures to display standard deviations as error bars, state the number of repetitions explicitly, and add a brief discussion of statistical significance (e.g., paired t-tests for key comparisons). revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The abstract explicitly describes the four-parameter wave cost model as fitted from one-time profiling data and the performance-region analysis as derived from hardware constants alone, with empirical results (0.93% regret, correct predictions on 8 architectures including 3 unseen) presented as outcomes rather than definitional. No equations, self-referential definitions, or reductions of predictions to inputs by construction appear in the provided text. The kernel-agnostic claim and speedups rest on these stated derivations without evidence of self-citation load-bearing or ansatz smuggling. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a fitted four-parameter cost model and the assumption that performance regions are derivable from hardware constants alone; no new physical entities are postulated.

free parameters (1)
  • four wave cost model parameters
    Fitted from 10-24 minutes of one-time profiling per model to achieve 0.93% mean regret
axioms (1)
  • domain assumption Performance regions for kernel optimizations can be derived from hardware constants alone and correctly predict behavior on unseen architectures
    Invoked to claim the analysis works for all 8 tested architectures including 3 unseen

pith-pipeline@v0.9.0 · 5515 in / 1498 out tokens · 67654 ms · 2026-05-07T16:32:37.920148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, “DeepSeek-V3 technical report,”arXiv preprint arXiv:2412.19437, 2024. https://doi.org/10.48550/arXiv.2412.19437

  2. [2]

    Olmoe: Open mixture-of-experts language models

    N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambert, Y . Gu, S. Arora, A. Bhagia, D. Schwenk, D. Wadden, A. Wettig, B. Hui, T. Dettmers, D. Kiela, A. Farhadi, N. A. Smith, P. W. Koh, A. Singh, and H. Hajishirzi, “OLMoE: Open mixture-of-experts language models,”arXiv preprint arXiv:2409.02060, 20...

  3. [3]

    Qwen3 Technical Report

    Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. https://doi.org/10.48550/arXiv.2505.09388

  4. [4]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pp. 611– 626, 2023. https://doi.org/10.1145/3600006.3613165

  5. [5]

    Alpha-moe: Fused mixture-of-experts kernel

    Aleph Alpha, “Alpha-moe: Fused mixture-of-experts kernel.” https: //github.com/Aleph-Alpha/Alpha-MoE, 2025

  6. [6]

    DeepGEMM: Clean and efficient fp8 gemm kernels

    DeepSeek-AI, “DeepGEMM: Clean and efficient fp8 gemm kernels.” https://github.com/deepseek-ai/DeepGEMM, 2025

  7. [7]

    Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

    Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “FlashInfer: Efficient and customizable attention engine for LLM inference serving,” inProceedings of Machine Learning and Systems (MLSys), 2025. https://doi.org/10.48550/arXiv.2501.01005

  8. [8]

    SonicMoE: Accelerating MoE with IO and tile-aware optimizations,

    W. Guo, M. Mishra, X. Cheng, I. Stoica, and T. Dao, “SonicMoE: Accelerating MoE with IO and tile-aware optimizations,”arXiv preprint arXiv:2512.14080, 2025. https://doi.org/10.48550/arXiv.2512.14080

  9. [9]

    CUTLASS: Cuda templates for linear algebra subroutines

    NVIDIA, “CUTLASS: Cuda templates for linear algebra subroutines.” https://github.com/NVIDIA/cutlass, 2024

  10. [10]

    arXiv:2211.15841 [cs.LG]https://arxiv.org/abs/2211.15841

    T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,” inProceedings of Machine Learning and Systems (MLSys), 2023. https://doi.org/10.48550/ arXiv.2211.15841

  11. [11]

    Projective characterization of higher- order quantum transformations, 2022

    C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,” inProceedings of Machine Learning and Systems (MLSys), 2023. https://doi.org/10.48550/arXiv. 2206.03382

  12. [12]

    Faster- MoE: Modeling and optimizing training of large-scale dynamic pre- trained models,

    J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Faster- MoE: Modeling and optimizing training of large-scale dynamic pre- trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 120– 134, 2022. https://doi.org/10.1145/3503221.3508418

  13. [13]

    Scattered mixture-of- experts implementation,

    S. Tan, Y . Shen, R. Panda, and A. Courville, “Scattered mixture-of- experts implementation,”arXiv preprint arXiv:2403.08245, 2024. https: //doi.org/10.48550/arXiv.2403.08245

  14. [14]

    SGLang: Efficient Execution of Structured Language Model Programs

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “SGLang: Efficient execution of structured language model programs,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024. https://doi.org/10.48550/arXiv.2312.07104

  15. [15]

    Triton : an intermediate language and compiler for tiled neural network computations

    P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pp. 10–19, 2019. https://doi.org/ 10.1145/3315508.3329973

  16. [16]

    Ansor: Generating high-performance tensor programs for deep learning,

    L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y . Wang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, R. Bodik, and I. Stoica, “Ansor: Generating high-performance tensor programs for deep learning,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 863–879, 2020. https://doi.org/10.48550/arXiv.2006.06762

  17. [17]

    Roofline: An insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. https://doi.org/10.1145/ 1498765.1498785

  18. [18]

    Quantifying per- formance bottlenecks of stencil computations using the Execution- Cache-Memory model,

    H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying per- formance bottlenecks of stencil computations using the Execution- Cache-Memory model,” inProceedings of the 29th ACM International Conference on Supercomputing (ICS), pp. 207–216, ACM, 2015. https: //doi.org/10.1145/2751205.2751240