Recognition: unknown
COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels
Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3
The pith
Activation entropy from fixed-bin histograms recovers locally optimal chunk sizes for Mamba selective scan kernels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COREY establishes that fixed-bin activation entropy can be mapped to chunk sizes via the rule H_ref equals log K, recovering the locally optimal chunk and matching a static oracle at the kernel level. This produces the reported latency reductions on both consumer and accelerator hardware. In end-to-end routed ablations the entropy-guided choice preserves exact output equivalence yet is outperformed on throughput by the best static chunk because of scheduling overhead; guarded fallbacks and sequence-length tables reduce that overhead to 0.7 percent. A mixed-regime study further shows that one sequence-length rule matches the per-regime chunk oracle.
What carries the argument
The entropy-to-chunk mapping that applies the fixed rule H_ref = log K to fixed-bin activation histograms to decide the chunk size for each selective scan kernel invocation.
If this is right
- Kernel latency drops by roughly four times when the entropy rule replaces an unoptimized baseline on both consumer and data-center GPUs.
- Routed inference using entropy-guided chunks produces identical outputs and metrics to static chunk selection.
- Scheduling overhead falls to 0.7 percent when a sequence-length-keyed table replaces per-sequence histogram sampling.
- A single fixed sequence-length rule matches the performance of per-regime chunk oracles in balanced serving workloads.
Where Pith is reading between the lines
- Hardware or compiler primitives that make entropy histogram collection and chunk selection nearly free could allow the dynamic rule to exceed static throughput in end-to-end runs.
- The same entropy signal could be tested for scheduling decisions in other linear-time sequence models whose compute patterns also vary with input statistics.
- Re-calibrating the reference entropy value on larger model scales or different accelerators might extend the kernel-level gains without increasing overhead.
Load-bearing premise
Activation entropy from fixed-bin histograms accurately predicts the per-sequence optimal chunk size for selective scan kernels across the tested models and workloads.
What would settle it
On a new sequence or checkpoint, compute the chunk chosen by H_ref = log K and measure its kernel latency; if that latency exceeds the latency of the empirically best chunk size for the same sequence by more than a small margin, the predictive mapping fails.
Figures
read the original abstract
Mamba selective state space models (SSMs) provide linear-time sequence modeling but remain sensitive to selective-scan chunk scheduling. We present COREY, a \emph{concept-and-feasibility} runtime scheduler that maps fixed-bin activation entropy to chunk size. We evaluate COREY in three tiers: a prototype cost model, real-checkpoint kernel timing, and routed end-to-end ablations on modern GPUs. At the kernel level, a calibrated rule, \(H_{\mathrm{ref}}=\log K\), recovers the locally optimal chunk and matches a one-time static oracle, yielding \(4.41\times\) lower latency than an unoptimized baseline on a consumer GPU and \(3.90\times\)--\(4.04\times\) lower latency on a data-center accelerator. Routing this choice into a patched live scan kernel closes the engineering loop without improving end-to-end speed: in unified routed ablations, the best static chunk outperforms all entropy-guided and proxy schedulers. Sampled-histogram COREY adds \(+4.6\%\) overhead; a guarded fallback to Static-512 reduces this to \(+1.3\%\); and a lightweight sequence-length-keyed table further reduces it to \(+0.7\%\). However, both remain slower than the static oracle because they retain scheduling cost. On an 80-prompt LongBench subset, passive and routed inference are exactly output-equivalent, with \(100\%\) greedy-token agreement and zero metric deltas. A mixed-regime study shows that a single sequence-length rule matches the per-regime chunk oracle for balanced serving. COREY is therefore validated as a quality-preserving scheduling prototype, but current entropy statistics are not a robust throughput win over static chunk tuning on measured SSM checkpoint workloads. SourceCode: https://github.com/mabo1215/COREY_Transformer/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces COREY, a concept-and-feasibility runtime scheduler for Mamba selective-scan kernels that maps fixed-bin activation entropy to chunk size via the calibrated rule H_ref = log K. Kernel-level timings show this rule recovers locally optimal chunks and matches a static oracle (4.41× latency reduction on consumer GPU, 3.90–4.04× on data-center accelerator). End-to-end routed ablations on LongBench, however, find that the best static chunk outperforms all entropy-guided and proxy schedulers; overheads are quantified (+4.6% for sampled-histogram, +1.3% guarded, +0.7% table-based) and output equivalence is verified (100% token agreement). A mixed-regime study shows a single sequence-length rule suffices for balanced serving.
Significance. If a low-overhead entropy predictor could be shown to reliably select per-sequence optima without calibration to observed averages, COREY would enable adaptive chunking that improves throughput under variable-length workloads while preserving correctness. As presented, the work usefully quantifies overhead sources and demonstrates that current fixed-bin entropy statistics do not yield a net win over static tuning on measured checkpoints, providing a clear negative result and prototype for future scheduler design.
major comments (2)
- [Abstract] Abstract: the kernel-level claim that H_ref = log K 'recovers the locally optimal chunk' is load-bearing, yet the same paragraph states that 'the best static chunk outperforms all entropy-guided schedulers' in routed end-to-end ablations. This tension implies either that the entropy statistic does not select the true per-sequence optimum or that calibration merely reproduces average static behavior; the manuscript must quantify the per-sequence correlation (or lack thereof) between histogram entropy and measured optimal chunk size to resolve the discrepancy.
- [Abstract] Abstract and overhead breakdown: the guarded and table-based variants still underperform the static oracle 'because they retain scheduling cost.' If the entropy predictor is accurate, the residual overhead should be eliminable by a cheaper implementation; the paper should report the exact fraction of latency attributable to histogram computation versus decision logic, and test whether removing the decision entirely (i.e., always using the entropy-chosen chunk) would close the gap.
minor comments (1)
- [Abstract] The abstract refers to 'unified routed ablations' and 'passive and routed inference' without defining the routing mechanism or the precise difference between the two modes; a short paragraph or diagram in §3 would clarify the experimental setup.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed report. The comments highlight important points about clarifying the relationship between kernel-level and end-to-end results, as well as providing a finer-grained overhead analysis. We have revised the manuscript to incorporate the requested per-sequence correlation quantification and latency breakdown. Our responses to the major comments are below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the kernel-level claim that H_ref = log K 'recovers the locally optimal chunk' is load-bearing, yet the same paragraph states that 'the best static chunk outperforms all entropy-guided schedulers' in routed end-to-end ablations. This tension implies either that the entropy statistic does not select the true per-sequence optimum or that calibration merely reproduces average static behavior; the manuscript must quantify the per-sequence correlation (or lack thereof) between histogram entropy and measured optimal chunk size to resolve the discrepancy.
Authors: We agree that the abstract as originally written could be read as creating tension, and we have revised it for clarity. The kernel-level claim is that, within each fixed entropy bin, the rule H_ref = log K selects the chunk size that empirically minimizes latency in isolated selective-scan kernel benchmarks; this matches the per-bin static oracle in those controlled timings. In contrast, the end-to-end routed ablations measure full inference throughput on LongBench, where the cost of computing the entropy histogram and routing the chosen chunk is incurred on every sequence. To directly address the request, we added a new analysis (Section 4.3 and Figure 7) that reports the per-sequence Spearman correlation between the activation histogram entropy and the measured optimal chunk size for each sequence. The observed correlation is moderate (rho = 0.51), with substantial scatter attributable to sequence-specific factors such as token distribution and hardware cache behavior. This imperfect correlation explains why the entropy-guided schedulers do not outperform the single best static chunk in aggregate end-to-end metrics, even though they recover local optima at the kernel level. We have updated the abstract to distinguish these two regimes explicitly. revision: yes
-
Referee: [Abstract] Abstract and overhead breakdown: the guarded and table-based variants still underperform the static oracle 'because they retain scheduling cost.' If the entropy predictor is accurate, the residual overhead should be eliminable by a cheaper implementation; the paper should report the exact fraction of latency attributable to histogram computation versus decision logic, and test whether removing the decision entirely (i.e., always using the entropy-chosen chunk) would close the gap.
Authors: We agree that a more granular breakdown is useful and have added it to the revised overhead section. For the sampled-histogram variant, histogram computation accounts for 68% of the measured overhead, decision logic for 22%, and miscellaneous kernel-launch costs for 10%. The guarded variant reduces decision logic to 12% via a simple threshold, while the table-based variant reduces it to 4% via a length-keyed lookup. We also implemented and evaluated the suggested 'pure entropy' ablation that always applies the entropy-chosen chunk with no guarding or fallback. This variant does not close the gap to the static oracle; it performs slightly worse than the guarded version because the entropy statistic occasionally selects a suboptimal chunk. These results are consistent with the moderate per-sequence correlation reported above and support the manuscript's conclusion that current fixed-bin entropy statistics do not yield a net throughput improvement over static tuning once scheduling costs are included. The abstract and overhead discussion have been updated accordingly. revision: yes
Circularity Check
Calibrated H_ref=log K rule recovers optimal chunks by construction of the fit
specific steps
-
fitted input called prediction
[Abstract]
"At the kernel level, a calibrated rule, H_ref=log K, recovers the locally optimal chunk and matches a one-time static oracle, yielding 4.41× lower latency than an unoptimized baseline on a consumer GPU and 3.90×–4.04× lower latency on a data-center accelerator."
The rule is calibrated specifically to recover the locally optimal chunk size (and match the oracle), so the reported match and latency gains are by construction of that fitting rather than an independent prediction from entropy. The 'recovers' language presents a fitted heuristic as a derived result.
full rationale
The paper's central kernel-level result is presented as the entropy rule recovering the locally optimal chunk size and matching a static oracle. However, the abstract explicitly describes this as a 'calibrated rule', meaning the threshold is fitted to the observed optima rather than independently derived or predicted. This reduces the 'recovery' claim to a tautology of the calibration process. End-to-end results further note that static chunks outperform the guided scheduler, but the kernel claim itself relies on the fitted mapping. No independent first-principles derivation or per-sequence correlation is shown in the provided text; the result is statistically forced by the calibration step.
Axiom & Free-Parameter Ledger
free parameters (1)
- H_ref =
log K
axioms (1)
- domain assumption Fixed-bin activation entropy correlates with the chunk size that minimizes selective-scan kernel latency.
Reference graph
Works this paper leans on
-
[1]
Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, and Diana Marculescu. Quamba: A post-training quantization recipe for selective state space models.arXiv preprint arXiv:2410.13229,
-
[2]
Albert Gu and Tri Dao. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review arXiv
-
[3]
Zukang Xu, Yuxuan Yue, Xing Hu, Zhihang Yuan, Zixu Jiang, Zhixuan Chen, Jiangyong Yu, Chen Xu, Sifan Zhou, and Dawei Yang. Mambaquant: Quantizing the mamba family with variance aligned rotation methods.arXiv preprint arXiv:2501.13484,
-
[4]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Shen, Hao Peng, Yoon Kim, Alexander Rush, and Tri Dao. Paral- lelizing linear transformers with the delta rule over sequence length. https://arxiv.org/abs/ 2406.06484,
-
[5]
NLP scores copied from the matched policy_offrow (policy-independent)
re- flects JIT warm-up included in the policy_offn=5 run. NLP scores copied from the matched policy_offrow (policy-independent). Policy Model NarrQA Qasper GovRpt WT103 PPL PG19 PPL Avg Lat. (ms) offMamba-370M 0.0299 0.0458 0.1451 556.93 17.20 5142 offMamba-1.4B 0.0502 0.0827 0.1750 323.13 13.79 5552 offMamba-2.8B 0.0445 0.0399 0.1239 329.80 12.68 7343 st...
-
[6]
vs. best static
This confirms that the sensitivity observed in Table 20 is driven by the Href= logK mismatch rather than by the bin count itself. (3) Recalibrating Href to logK simultaneously removes the K-sensitivity reported here and the Href-sensitivity reported above, 18 Table 21: Chunk selection as a function of histogram resolution K. H is the Shannon entropy (nats...
2048
-
[7]
The W1 reference speedup is the standalone chunk-latency ratio from Table 1 of the main paper (W1 chunked-scan benchmark) for the selected chunk vs
Entropy overhead is the steady-state cost of applying the histogram estimator to the captured u tensor (B=1, dinner=2048, L=42). The W1 reference speedup is the standalone chunk-latency ratio from Table 1 of the main paper (W1 chunked-scan benchmark) for the selected chunk vs. static-64; it is not an end-to-end checkpoint speedup. LayerH(u)(nats) Chunk vs...
2048
-
[8]
• Sequence lengths: {1024,2048,4096,8192,16384,32768,65536} with sample count min(4096,max(512, L/8)). • Scheduler defaults: (α, β, γ) = (0.45,0.35,0.20) , default τ= 0.52 , threshold sweep {0.45,0.52,0.60} , static-fusion group size 3, and entropy-driven tile mapping from 64 to 512 rounded to multiples of
2048
-
[9]
mamba isn’t sup- ported yet
Moreover, writing Hji =s ji/ √ d with sji ∈ {±1} the (j, i) entry of the unnormalized Hadamard matrix, every rotated coordinate obeys |zj|= 1√ d dX i=1 sjixi ≤ 1√ d dX i=1 |xi|= ∥x∥1√ d , which implies the peak-coordinate bound ∥z∥∞ ≤ ∥x∥1√ d . For a clipping thresholdT k associated with ak-bit quantizer, overflow therefore satisfies Pr(∥z∥∞ > T k)≤Pr ∥x∥...
2048
-
[10]
similarly routes to chunk= 512 for this prompt and gives 897.6±7.4 ms (+0.69% vs. H800 Static-512). The H800 run confirms that both policies correctlyfall back to chunk = 512 for long prompts, avoiding the +4.6% penalty of unguarded sampled histogram; the residual overhead is +1.3% (guarded) and +0.7% (learned). Note that the H800 measurements below areno...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.