arxiv: 2605.12999 · v1 · submitted 2026-05-13 · 🧬 q-bio.NC · cs.LG

Recognition: no theorem link

Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale

Ash Robbins, David Haussler, Jason Eshraghian, Jesus Gonzalez-Ferrer, Jinghui Geng, John R. Minnick, Kamran Hussain, Mircea Teodorescu, Mohammed A. Mostajo-Radji

Pith reviewed 2026-05-14 02:09 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.LG

keywords mambaspike forecastingbehavioral decodingneuropixelsbrain-computer interfaceneural populationvisual discriminationclosed-loop bci

0 comments

The pith

A single Mamba forecaster trained only on next-step spike counts delivers both neural forecasts and improved behavioral decoding in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a sequence model trained exclusively to predict the next bin of population spike counts learns internal representations that also carry information about an animal's behavioral state. A lightweight linear head attached to the model's predicted rates decodes mouse choice and stimulus side more accurately than the identical linear classifier applied to the raw spikes, under matched temporal context. This result is demonstrated on 39 Neuropixels sessions covering thousands of neurons and nearly two thousand held-out trials in a visual discrimination task. The approach matters for closed-loop brain-computer interfaces because it collapses separate forecasting and decoding pipelines into a single trained model that still fits inside typical 50-millisecond bin budgets.

Core claim

A Mamba model trained solely to forecast the next 50 ms bin of spike counts produces predicted rates that, when read by a per-session linear head, decode trial-by-trial choice at 75.7 percent and stimulus side at 66.1 percent across held-out data, outperforming a matched linear decoder on the original spikes by 4-6 percentage points while also supplying an explicit activity forecast.

What carries the argument

The Mamba forecaster's one-step-ahead predicted firing rates, which act as an implicit embedding of population dynamics for downstream linear readout of behavior.

If this is right

Closed-loop BCIs can obtain both upcoming activity forecasts and behavioral readouts from one forward pass without separate training stages.
A short session-start calibration block of roughly 100-150 trials brings the behavioral readout within 1-2 points of its asymptotic performance.
The complete pipeline remains computationally light enough to run inside the 50 ms bin window on standard workstation GPUs used with tethered Neuropixels.
The advantage holds across three independent training seeds and 39 sessions spanning 27,000 neurons and 1,994 held-out trials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forecasting objectives may improve behavioral decoding for other sequence architectures beyond Mamba when applied to large-scale spike data.
Next-step spike predictions could serve as a probe for how neural populations encode impending actions before they are executed.
Extending the forecast horizon might allow the same model to predict future behavioral states in addition to immediate ones.

Load-bearing premise

The decoding gain comes specifically from the next-step forecasting objective and the representations it produces rather than from Mamba's architecture, its use of longer context, or per-session fitting details of the linear head.

What would settle it

Train an otherwise identical Mamba with the same capacity and context length but without any forecasting loss, then test whether the linear head on its states still beats the raw-spike baseline by 4-6 points.

Figures

Figures reproduced from arXiv: 2605.12999 by Ash Robbins, David Haussler, Jason Eshraghian, Jesus Gonzalez-Ferrer, Jinghui Geng, John R. Minnick, Kamran Hussain, Mircea Teodorescu, Mohammed A. Mostajo-Radji.

**Figure 1.** Figure 1: Architecture and processing pipeline. Spike-sorted Neuropixels recordings (𝑀 neurons, 50 ms bins) are windowed into a history matrix 𝑋𝑡 ∈ R 𝑀×𝐻 (𝐻=10 bins, 500 ms context). A Mamba selective state-space model (inset: per-step recurrence ℎ𝑡 = 𝐴 ℎ¯ 𝑡−1 + 𝐵 𝑥 ¯ 𝑡 , 𝑦𝑡 = 𝐶 ℎ𝑡) maps 𝑋𝑡 to predicted next-bin firing rates 𝜆ˆ 𝑡+1 ∈ R 𝑀. A per-session multinomial linear readout 𝑦ˆ = softmax(𝑊𝜆ˆ + 𝑏) decodes mouse r… view at source ↗

**Figure 2.** Figure 2: The prediction substrate our matched-context analysis reads from (Mamba forecaster, Steinmetz 39). (a) 5×5 grid of session pairs (ground truth vs. Mamba prediction, neurons sorted by physical probe depth) showing the 25 sessions with highest mean per-neuron Pearson 𝑟 out of the 39-session benchmark (the highlighted session is promoted to position (2, 1) for the projection in (b)). (b) The expanded session … view at source ↗

**Figure 3.** Figure 3: Behavioral decodability from Mamba’s predicted rates vs. matched-context raw counts (Steinmetz 39 sessions, 1,994 held-out trials, 20% trial-level holdout, 3 training seeds). Each row is a downstream linear decoder operating on a different feature set: raw spike counts in three context windows (1 bin, 𝐻=10 summed, 𝐻=10 flattened), or Mamba’s predicted rates (3 seeds). Top: bin-level accuracy with 95% boots… view at source ↗

**Figure 4.** Figure 4: Population shuffle test: cross-neuron coupling drives Mamba’s per-neuron predictability (39 Steinmetz sessions, 18,881 neurons after filtering out near-silent units whose per-neuron 𝑟 is dominated by zeros). (a) Per-neuron 𝑟 before (blue) and after (orange) shuffling other neurons’ time series within each session (sorted by original 𝑟). (b) Per-session mean 𝑟 before vs. after shuffle. (c) Distribution of … view at source ↗

**Figure 5.** Figure 5: Per-session linear readout convergence on Mamba rate predictions (Steinmetz 39 sessions, 5 random training-subset draws per fraction; direct 3-class side classifier matching Tab. 1). Asymptotic accuracies at 80% of trials (median 162 trials/session): response 76.9%, 16-class contrast 37.8%, stimulus side 67.1%. The calibration sweep averages over 5 random training-subset draws (each draw resamples which 80… view at source ↗

**Figure 6.** Figure 6: Cross-session population-rate temporal alignment (Mamba forecaster, Steinmetz 39). Naive bin-to-bin MAE (blue) vs. DTW-aligned average error (vermillion) for each session, sorted by naive MAE descending. Mean DTW improvement is 42.5% (naive 18.8 → DTW 10.8 spikes/bin) and every session benefits. The exemplar session 10 used in Fig. 2d/e is one of the average performers in this distribution, not a cherry-pi… view at source ↗

read the original abstract

Closed-loop brain-computer interfaces often require both a forecast of upcoming neural population activity and a readout of the animal's behavioral state. A single Mamba forecaster, trained only on next-step spike counts at Neuropixels scale, can deliver both in one forward pass. A lightweight per-session linear head reading the model's predicted rates decodes behavior better than the same linear classifier reading the raw spike counts, under matched temporal context. We test on the Steinmetz visual-discrimination benchmark, which spans 39 sessions, roughly 27,000 neurons, and 1,994 held-out trials. Across three training seeds, Mamba's predicted rates decode mouse choice at 75.7$\pm$0.2% trial vote, roughly 2.3 times chance level, and stimulus side at 66.1$\pm$0.6%, about twice chance. Compared to a matched 500 ms-context linear decoder on the raw spike counts, Mamba wins at trial vote by 4-6 pp on response and 4-6 pp on stimulus side. A session-start calibration block of about 100-150 trials brings the readout within 1-2 pp of asymptote, and the full pipeline fits inside the 50 ms bin budget on workstation-class GPUs typical of tethered chronic Neuropixels recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mamba next-step spike forecasting yields a 4-6 pp lift in linear behavioral decoding on Steinmetz, but the gain likely traces to model capacity and temporal mixing rather than the forecasting objective itself.

read the letter

The main thing to know is that training a Mamba on next-step spike counts at Neuropixels scale produces predicted rates that a per-session linear head can decode into mouse choice and stimulus side better than raw spikes by 4-6 points on the Steinmetz benchmark, with the whole thing running inside a 50 ms budget after a short calibration block. That is a concrete empirical result on held-out trials across 39 sessions and nearly 27k neurons, and it is new in the sense that prior work on this dataset has not reported exactly this comparison of forecasting-derived features against raw counts for linear readout. The practical framing is also useful: one forward pass supplies both a population forecast and a behavior estimate without needing separate labeled training for the decoder. The numbers are reported with three seeds and stay stable after 100-150 trials of session-start calibration, which matches real chronic recording constraints. The comparison uses matched 500 ms context, so the baseline is not obviously underpowered. The soft spot is that nothing isolates whether the forecasting loss is required. The lift could come from Mamba's nonlinear long-range integration or sheer capacity even if the model were trained on shuffled spikes or a reconstruction objective. The per-session linear head has free parameters that can exploit whatever internal state the sequence model builds, so the headline claim that the forecasting objective itself unifies the two tasks rests on an untested assumption. Methods details on architecture, loss, and exact alignment are thin in the abstract, which makes it harder to judge reproducibility right away. This is worth a serious referee for groups working on real-time neural interfaces who need to check the numbers on the public benchmark and run the missing ablations. I would bring it to a reading group for discussion but would not cite it in my own work until the source of the gain is clearer.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that a single Mamba model trained exclusively on next-step spike-count forecasting from Neuropixels-scale recordings can simultaneously predict future population activity and support behavioral decoding. A lightweight per-session linear head applied to the model's predicted rates decodes mouse choice at 75.7±0.2% and stimulus side at 66.1±0.6% on 1,994 held-out trials from the Steinmetz visual-discrimination benchmark (39 sessions, ~27k neurons), outperforming a matched 500 ms-context linear decoder on raw spikes by 4-6 pp; a short calibration block of 100-150 trials suffices to reach near-asymptotic performance.

Significance. If the central empirical comparison holds after controls, the work would show that an unsupervised next-step forecasting objective on high-dimensional spiking data can extract behaviorally relevant features without explicit labels, offering a unified architecture for closed-loop BCIs that jointly forecasts neural dynamics and reads out state. The use of a public benchmark with held-out trials and multiple seeds provides a reproducible starting point, though the source of the reported lift remains to be isolated.

major comments (2)

[Abstract/Results] Abstract and Results: the 4-6 pp gain of the Mamba-predicted-rate linear head over the raw-spike linear baseline is presented as evidence that the forecasting objective yields useful representations, yet no ablation is reported that trains an equivalently capacious Mamba (or other sequence model) under a non-forecasting loss (e.g., reconstruction or shuffled-spike next-token prediction) to test whether the performance difference requires the next-step objective.
[Methods] Methods: the manuscript provides no specification of Mamba block count, state dimension, training loss (exact negative log-likelihood formulation), optimizer, learning-rate schedule, or the precise temporal offset between the 500 ms input context and the predicted-rate vector used by the linear head; these omissions prevent assessment of whether the reported accuracies arise from forecasting per se or from differences in temporal integration and capacity.

minor comments (2)

[Abstract] The abstract states accuracies 'across three training seeds' but does not indicate whether the reported ±0.2% and ±0.6% intervals are standard deviation, standard error, or trial-level bootstrap; this should be clarified for statistical interpretation.
[Results] Figure or table presenting per-session decoding accuracies is not referenced in the provided text; inclusion of such a breakdown would allow readers to assess variability across the 39 sessions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity and completeness of the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract/Results] Abstract and Results: the 4-6 pp gain of the Mamba-predicted-rate linear head over the raw-spike linear baseline is presented as evidence that the forecasting objective yields useful representations, yet no ablation is reported that trains an equivalently capacious Mamba (or other sequence model) under a non-forecasting loss (e.g., reconstruction or shuffled-spike next-token prediction) to test whether the performance difference requires the next-step objective.

Authors: We agree that an ablation with a non-forecasting objective would more cleanly isolate the contribution of next-step prediction. In the revised manuscript we have added a dedicated paragraph in the Discussion that motivates the forecasting loss on theoretical grounds (temporal structure extraction) and contrasts it with reconstruction or shuffled-token alternatives. We also outline a planned limited ablation for follow-up work, noting that full retraining at this scale is computationally intensive. This is a partial revision. revision: partial
Referee: [Methods] Methods: the manuscript provides no specification of Mamba block count, state dimension, training loss (exact negative log-likelihood formulation), optimizer, learning-rate schedule, or the precise temporal offset between the 500 ms input context and the predicted-rate vector used by the linear head; these omissions prevent assessment of whether the reported accuracies arise from forecasting per se or from differences in temporal integration and capacity.

Authors: We thank the referee for highlighting these omissions. The revised Methods section now specifies: 6 Mamba blocks with state dimension 128; exact Poisson negative log-likelihood loss; AdamW optimizer with initial learning rate 5e-4, linear warmup for 5 epochs followed by cosine decay over 50 total epochs; and a 50 ms prediction offset (10 bins of 50 ms input context to predict the immediate next 50 ms rate vector). These additions confirm that temporal integration is matched to the linear baseline and improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result is held-out empirical comparison

full rationale

The central claim is that a Mamba trained solely on next-step spike-count forecasting yields predicted rates whose linear readout decodes behavior better than a matched-context linear decoder on raw spikes. This is evaluated on held-out trials from the Steinmetz benchmark (1,994 trials across 39 sessions). The reported accuracies (75.7% choice, 66.1% stimulus) are direct performance measurements on independent test data and do not reduce by construction to any fitted parameter, self-referential equation, or self-citation chain. No load-bearing step invokes a uniqueness theorem, ansatz smuggled via prior work, or renaming of a known result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical performance numbers and the assumption that a next-step forecasting objective extracts behaviorally relevant features; no explicit free parameters beyond standard model weights and per-session linear heads are described.

free parameters (1)

per-session linear head weights
Lightweight linear readout trained separately per session on the model's predicted rates.

axioms (1)

domain assumption Next-step spike prediction objective captures decodable behavioral information in its internal representations.
Invoked to explain why the linear head on predicted rates outperforms the head on raw spikes.

pith-pipeline@v0.9.0 · 5572 in / 1406 out tokens · 54873 ms · 2026-05-14T02:09:04.474403+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Azabou, V

M. Azabou, V. Arora, V. Ganesh, X. Mao, S. Nachimuthu, M. Mendelson, B. Richards, M. Perich, G. Lajoie, and E. Dyer. A Unified, Scalable Framework for Neural Population Decoding. In NeurIPS, 2023

work page 2023
[2]

Duncker and M

L. Duncker and M. Sahani. Temporal alignment and latent Gaussian process factor inference in population spike trains. NeurIPS, 2018

work page 2018
[3]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Banga, J

International Brain Laboratory, K. Banga, J. Benson, J. Bhagat, D. Biderman, D. Birman, N. Bonacchi, S. A. Bruijns, K. Buchanan, R. A. A. Campbell, et al. Reproducibility of in vivo electrophysiological measurements in mice. eLife, 13:RP100840, 2025

work page 2025
[5]

S. W. Linderman, M. J. Johnson, A. C. Miller, R. P. Adams, D. M. Blei, and L. Paninski. Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In AISTATS, volume 54, pages 914--922, 2017

work page 2017
[6]

Mante, D

V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78--84, 2013

work page 2013
[7]

Pandarinath et al

C. Pandarinath et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods, 15(10):805--815, 2018

work page 2018
[8]

J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, and E. P. Simoncelli. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995--999, 2008

work page 2008
[9]

Schneider, J

S. Schneider, J. H. Lee, and M. W. Mathis. Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960):360--368, 2023

work page 2023
[10]

N. A. Steinmetz, P. Zatka-Haas, M. Carandini, and K. D. Harris. Distributed coding of choice, action and engagement across the mouse brain. Nature, 576(7786):266--273, 2019

work page 2019
[11]

arXiv preprint arXiv:2108.01210 , year=

J. Ye and C. Pandarinath. Representation learning for neural population activity with neural data transformers. arXiv:2108.01210, 2021

work page arXiv 2021
[12]

J. Ye, J. Collinger, L. Wehbe, and R. Gaunt. Neural Data Transformer 2: Multi-context pretraining for neural spiking activity. In NeurIPS, 2023

work page 2023
[13]

Zhang, Y

Y. Zhang, Y. Wang, M. Azabou, A. Andre, Z. Wang, H. Lyu, The International Brain Laboratory, E. L. Dyer, L. Paninski, and C. L. Hurwitz. Neural encoding and decoding at scale. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of PMLR, 2025

work page 2025