Recognition: no theorem link
Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale
Pith reviewed 2026-05-14 02:09 UTC · model grok-4.3
The pith
A single Mamba forecaster trained only on next-step spike counts delivers both neural forecasts and improved behavioral decoding in one forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A Mamba model trained solely to forecast the next 50 ms bin of spike counts produces predicted rates that, when read by a per-session linear head, decode trial-by-trial choice at 75.7 percent and stimulus side at 66.1 percent across held-out data, outperforming a matched linear decoder on the original spikes by 4-6 percentage points while also supplying an explicit activity forecast.
What carries the argument
The Mamba forecaster's one-step-ahead predicted firing rates, which act as an implicit embedding of population dynamics for downstream linear readout of behavior.
If this is right
- Closed-loop BCIs can obtain both upcoming activity forecasts and behavioral readouts from one forward pass without separate training stages.
- A short session-start calibration block of roughly 100-150 trials brings the behavioral readout within 1-2 points of its asymptotic performance.
- The complete pipeline remains computationally light enough to run inside the 50 ms bin window on standard workstation GPUs used with tethered Neuropixels.
- The advantage holds across three independent training seeds and 39 sessions spanning 27,000 neurons and 1,994 held-out trials.
Where Pith is reading between the lines
- Forecasting objectives may improve behavioral decoding for other sequence architectures beyond Mamba when applied to large-scale spike data.
- Next-step spike predictions could serve as a probe for how neural populations encode impending actions before they are executed.
- Extending the forecast horizon might allow the same model to predict future behavioral states in addition to immediate ones.
Load-bearing premise
The decoding gain comes specifically from the next-step forecasting objective and the representations it produces rather than from Mamba's architecture, its use of longer context, or per-session fitting details of the linear head.
What would settle it
Train an otherwise identical Mamba with the same capacity and context length but without any forecasting loss, then test whether the linear head on its states still beats the raw-spike baseline by 4-6 points.
Figures
read the original abstract
Closed-loop brain-computer interfaces often require both a forecast of upcoming neural population activity and a readout of the animal's behavioral state. A single Mamba forecaster, trained only on next-step spike counts at Neuropixels scale, can deliver both in one forward pass. A lightweight per-session linear head reading the model's predicted rates decodes behavior better than the same linear classifier reading the raw spike counts, under matched temporal context. We test on the Steinmetz visual-discrimination benchmark, which spans 39 sessions, roughly 27,000 neurons, and 1,994 held-out trials. Across three training seeds, Mamba's predicted rates decode mouse choice at 75.7$\pm$0.2% trial vote, roughly 2.3 times chance level, and stimulus side at 66.1$\pm$0.6%, about twice chance. Compared to a matched 500 ms-context linear decoder on the raw spike counts, Mamba wins at trial vote by 4-6 pp on response and 4-6 pp on stimulus side. A session-start calibration block of about 100-150 trials brings the readout within 1-2 pp of asymptote, and the full pipeline fits inside the 50 ms bin budget on workstation-class GPUs typical of tethered chronic Neuropixels recordings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a single Mamba model trained exclusively on next-step spike-count forecasting from Neuropixels-scale recordings can simultaneously predict future population activity and support behavioral decoding. A lightweight per-session linear head applied to the model's predicted rates decodes mouse choice at 75.7±0.2% and stimulus side at 66.1±0.6% on 1,994 held-out trials from the Steinmetz visual-discrimination benchmark (39 sessions, ~27k neurons), outperforming a matched 500 ms-context linear decoder on raw spikes by 4-6 pp; a short calibration block of 100-150 trials suffices to reach near-asymptotic performance.
Significance. If the central empirical comparison holds after controls, the work would show that an unsupervised next-step forecasting objective on high-dimensional spiking data can extract behaviorally relevant features without explicit labels, offering a unified architecture for closed-loop BCIs that jointly forecasts neural dynamics and reads out state. The use of a public benchmark with held-out trials and multiple seeds provides a reproducible starting point, though the source of the reported lift remains to be isolated.
major comments (2)
- [Abstract/Results] Abstract and Results: the 4-6 pp gain of the Mamba-predicted-rate linear head over the raw-spike linear baseline is presented as evidence that the forecasting objective yields useful representations, yet no ablation is reported that trains an equivalently capacious Mamba (or other sequence model) under a non-forecasting loss (e.g., reconstruction or shuffled-spike next-token prediction) to test whether the performance difference requires the next-step objective.
- [Methods] Methods: the manuscript provides no specification of Mamba block count, state dimension, training loss (exact negative log-likelihood formulation), optimizer, learning-rate schedule, or the precise temporal offset between the 500 ms input context and the predicted-rate vector used by the linear head; these omissions prevent assessment of whether the reported accuracies arise from forecasting per se or from differences in temporal integration and capacity.
minor comments (2)
- [Abstract] The abstract states accuracies 'across three training seeds' but does not indicate whether the reported ±0.2% and ±0.6% intervals are standard deviation, standard error, or trial-level bootstrap; this should be clarified for statistical interpretation.
- [Results] Figure or table presenting per-session decoding accuracies is not referenced in the provided text; inclusion of such a breakdown would allow readers to assess variability across the 39 sessions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us improve the clarity and completeness of the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract/Results] Abstract and Results: the 4-6 pp gain of the Mamba-predicted-rate linear head over the raw-spike linear baseline is presented as evidence that the forecasting objective yields useful representations, yet no ablation is reported that trains an equivalently capacious Mamba (or other sequence model) under a non-forecasting loss (e.g., reconstruction or shuffled-spike next-token prediction) to test whether the performance difference requires the next-step objective.
Authors: We agree that an ablation with a non-forecasting objective would more cleanly isolate the contribution of next-step prediction. In the revised manuscript we have added a dedicated paragraph in the Discussion that motivates the forecasting loss on theoretical grounds (temporal structure extraction) and contrasts it with reconstruction or shuffled-token alternatives. We also outline a planned limited ablation for follow-up work, noting that full retraining at this scale is computationally intensive. This is a partial revision. revision: partial
-
Referee: [Methods] Methods: the manuscript provides no specification of Mamba block count, state dimension, training loss (exact negative log-likelihood formulation), optimizer, learning-rate schedule, or the precise temporal offset between the 500 ms input context and the predicted-rate vector used by the linear head; these omissions prevent assessment of whether the reported accuracies arise from forecasting per se or from differences in temporal integration and capacity.
Authors: We thank the referee for highlighting these omissions. The revised Methods section now specifies: 6 Mamba blocks with state dimension 128; exact Poisson negative log-likelihood loss; AdamW optimizer with initial learning rate 5e-4, linear warmup for 5 epochs followed by cosine decay over 50 total epochs; and a 50 ms prediction offset (10 bins of 50 ms input context to predict the immediate next 50 ms rate vector). These additions confirm that temporal integration is matched to the linear baseline and improve reproducibility. revision: yes
Circularity Check
No significant circularity; result is held-out empirical comparison
full rationale
The central claim is that a Mamba trained solely on next-step spike-count forecasting yields predicted rates whose linear readout decodes behavior better than a matched-context linear decoder on raw spikes. This is evaluated on held-out trials from the Steinmetz benchmark (1,994 trials across 39 sessions). The reported accuracies (75.7% choice, 66.1% stimulus) are direct performance measurements on independent test data and do not reduce by construction to any fitted parameter, self-referential equation, or self-citation chain. No load-bearing step invokes a uniqueness theorem, ansatz smuggled via prior work, or renaming of a known result. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-session linear head weights
axioms (1)
- domain assumption Next-step spike prediction objective captures decodable behavioral information in its internal representations.
Reference graph
Works this paper leans on
- [1]
-
[2]
L. Duncker and M. Sahani. Temporal alignment and latent Gaussian process factor inference in population spike trains. NeurIPS, 2018
work page 2018
-
[3]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
S. W. Linderman, M. J. Johnson, A. C. Miller, R. P. Adams, D. M. Blei, and L. Paninski. Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In AISTATS, volume 54, pages 914--922, 2017
work page 2017
- [6]
-
[7]
C. Pandarinath et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods, 15(10):805--815, 2018
work page 2018
-
[8]
J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P. Simoncelli. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995--999, 2008
work page 2008
-
[9]
S. Schneider, J. H. Lee, and M. W. Mathis. Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960):360--368, 2023
work page 2023
-
[10]
N. A. Steinmetz, P. Zatka-Haas, M. Carandini, and K. D. Harris. Distributed coding of choice, action and engagement across the mouse brain. Nature, 576(7786):266--273, 2019
work page 2019
-
[11]
arXiv preprint arXiv:2108.01210 , year=
J. Ye and C. Pandarinath. Representation learning for neural population activity with neural data transformers. arXiv:2108.01210, 2021
-
[12]
J. Ye, J. Collinger, L. Wehbe, and R. Gaunt. Neural Data Transformer 2: Multi-context pretraining for neural spiking activity. In NeurIPS, 2023
work page 2023
-
[13]
Y. Zhang, Y. Wang, M. Azabou, A. Andre, Z. Wang, H. Lyu, The International Brain Laboratory, E. L. Dyer, L. Paninski, and C. L. Hurwitz. Neural encoding and decoding at scale. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of PMLR, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.