arxiv: 2604.04958 · v2 · submitted 2026-04-03 · 🧬 q-bio.QM · cs.AI· q-bio.NC

Recognition: 2 theorem links

· Lean Theorem

Self-Supervised Foundation Model for Calcium-imaging Population Dynamics

Xinhong Xu , Yimeng Zhang , Qichen Qian , Yuanlong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:27 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIq-bio.NC

keywords calcium imagingself-supervised learningfoundation modelneural population dynamicsforecastingbehavior decodingtransformer

0 comments

The pith

A self-supervised model pretrained on calcium traces forecasts neural population dynamics better than specialized baselines and adapts to decode behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CalM, a self-supervised foundation model trained only on neuronal calcium traces. It uses a tokenizer to convert single-neuron traces into a shared discrete vocabulary and a dual-axis autoregressive transformer to capture dependencies across neurons and time. After pretraining on large multi-animal data, CalM outperforms strong specialized baselines on forecasting future population activity. Adding a simple task head lets it decode animal behavior more accurately than models trained directly with supervision. Linear probes on the learned representations also expose interpretable functional structures in the neural data.

Core claim

CalM is a self-supervised foundation model for calcium-imaging population dynamics trained solely on neuronal calcium traces. Its pretraining framework consists of a high-performance tokenizer that maps single-neuron traces into a shared discrete vocabulary and a dual-axis autoregressive transformer that models dependencies along both the neural and temporal axes. On the neural population dynamics forecasting task, CalM outperforms strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Linear analyses of CalM representations reveal interpretable func

What carries the argument

Tokenizer mapping single-neuron calcium traces to a shared discrete vocabulary paired with dual-axis autoregressive transformer capturing neural and temporal dependencies

If this is right

Pretrained CalM outperforms specialized baselines on neural population dynamics forecasting.
Adding a task-specific head lets CalM decode behavior more accurately than fully supervised models.
Linear analyses of the representations uncover interpretable functional structures in the neural population.
The approach supports scalable pretraining for multiple functional neural analysis tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pretrained representations could reduce the amount of labeled data needed for new neuroscience experiments.
The same backbone might adapt to other recording modalities such as electrophysiology with limited retraining.
Broad pretraining across animals could capture shared dynamical motifs that generalize across sessions or individuals.
Foundation-style models may eventually serve as starting points for analyzing many types of population recordings.

Load-bearing premise

The self-supervised pretraining with the tokenizer and dual-axis transformer learns representations that transfer effectively to multiple downstream tasks without requiring extensive task-specific architectural changes or data curation.

What would settle it

If a held-out multi-animal calcium dataset shows that pretrained CalM no longer outperforms baselines on forecasting accuracy or behavior decoding after adding the task head, the claimed transfer benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.04958 by Qichen Qian, Xinhong Xu, Yimeng Zhang, Yuanlong Zhang.

**Figure 1.** Figure 1: NQ network and its performance. (A) Details of NQ network. (B) Performance of NQ network on held-in and held-out datasets. We only train NQ network on the training sets of the held-in datasets (burgundy), and apply the trained model to all the other datasets to do evaluation and generate tokenized datasets (pink). The numbers shows the mean correlation for each bar. (C) Example neural traces from raw data … view at source ↗

**Figure 2.** Figure 2: DAT network and CalM framework. We tokenize the traces and train DAT model in an autoregressive manner. stream pretraining. The total objective is formulated as: Ltotal = Lr + λcLc + λentLent + λorthLorth + λARLAR (9) 3.3. Dual-Axis Transformer With the trained NQ model, we tokenize trial-wise neural recordings into discrete sequences with compressed temporal resolution, denoted as Z ∈ {1, 2, ..., K} N×Td… view at source ↗

**Figure 3.** Figure 3: Performance evaluation of CalM on neural population dynamics forecasting task. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance evaluation of CalM on behavior decoding. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Linear analysis for CalM framework. (A) PCA visualization shows that neurons with strong tuning on cue or choice are well separated in an unsupervised manner. (B) LDA analysis of all the neural emebddings show that cue- and choice-encoding form orthogonal gradient structures. (C) Low dimensional dynamics of forecasting results from CalM correlate with ground truth more closely than POCO. session settings, … view at source ↗

**Figure 6.** Figure 6: Confusion matrices for classification using CalM session embedding. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Shuffle analysis of the LDA structure shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Linear analysis of held-out dataset for CalM framework. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional calcium traces, existing approaches remain task-specific, limiting transfer across common neuroscience objectives. To address this challenge, we propose \textbf{CalM}, a self-supervised neural foundation model trained solely on neuronal calcium traces and adaptable to multiple downstream tasks, including forecasting and decoding. Our key contribution is a pretraining framework, composed of a high-performance tokenizer mapping single-neuron traces into a shared discrete vocabulary, and a dual-axis autoregressive transformer modeling dependencies along both the neural and the temporal axis. We evaluate CalM on a large-scale, multi-animal, multi-session dataset. On the neural population dynamics forecasting task, CalM outperforms strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Moreover, linear analyses of CalM representations reveal interpretable functional structures beyond predictive accuracy. Taken together, we propose a novel and effective self-supervised pretraining paradigm for foundation models based on calcium traces, paving the way for scalable pretraining and broad applications in functional neural analysis. Code will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalM uses a discrete tokenizer and dual-axis autoregressive transformer for self-supervised pretraining on calcium traces, claiming transferable gains on forecasting and decoding that still need the actual numbers to judge.

read the letter

The main point is that this paper describes CalM, a foundation model pretrained self-supervised on calcium imaging traces using a tokenizer that converts single-neuron signals into a shared discrete vocabulary, paired with a dual-axis autoregressive transformer that captures dependencies across neurons and over time. They show it can be adapted for forecasting neural population dynamics and for decoding behavior, outperforming task-specific models in their evaluations. What is new here is the pretraining framework itself. The tokenizer creating a common vocabulary across traces and the transformer's handling of both neural and temporal axes together form a specific approach not directly covered in the referenced prior literature. This setup allows the model to learn representations that transfer without heavy task-specific changes. The paper does well in identifying the limitation of current methods being too tied to individual tasks and in proposing a scalable alternative trained only on the traces themselves. Using a large multi-animal and multi-session dataset for evaluation is appropriate for demonstrating the benefits of pretraining. The additional linear analyses to find interpretable functional structures add some depth beyond just predictive performance. The soft spots are mainly around the presentation of results. The abstract makes strong claims about outperformance but does not provide quantitative metrics, error bars, or details on ablations, which leaves the strength of the evidence unclear until the full numbers are examined. The choice of tokenizer vocabulary size and transformer hyperparameters are free parameters that likely required tuning, and it would help to see how sensitive the results are to those. Since the code is to be released soon, that will be key for verification. This paper is for computational neuroscientists and researchers working with population-level calcium imaging data who are interested in pretrained models to reduce the need for custom training on each new dataset or task. A reader focused on developing or applying large models in neuroscience would find the architecture and pretraining strategy useful to consider or build upon. I would recommend engaging with the work through peer review. The idea has merit for advancing foundation models in this domain, and a referee could help clarify the empirical support and suggest improvements to the evaluation.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CalM, a self-supervised foundation model for calcium-imaging population dynamics. It consists of a tokenizer that maps single-neuron calcium traces to a shared discrete vocabulary and a dual-axis autoregressive transformer that models dependencies along both neural and temporal axes. Pretrained on a large multi-animal, multi-session dataset, CalM is claimed to outperform specialized baselines on neural population dynamics forecasting; with an added task-specific head it further achieves superior performance on behavior decoding relative to supervised models. Linear probes on the learned representations are said to reveal interpretable functional structures.

Significance. If the performance claims are substantiated with quantitative metrics, error bars, and statistical tests, the work would represent a meaningful step toward scalable, transferable representations for calcium-imaging data, potentially reducing reliance on task-specific architectures in functional neural analysis.

major comments (3)

[Abstract and §4] Abstract and §4 (Results): the central claims of outperformance on forecasting and decoding are stated without any numerical metrics, error bars, dataset sizes (number of neurons, sessions, animals), ablation results, or statistical tests, preventing evaluation of the reported gains over baselines.
[§3.2 and §3.3] §3.2 (Tokenizer) and §3.3 (Dual-axis transformer): the discretization thresholds and vocabulary size are listed as free parameters, yet no sensitivity analysis or ablation is provided to show that the claimed transferability does not depend on these choices.
[§4.2] §4.2 (Behavior decoding): the adaptation with a task-specific head is asserted to surpass supervised decoding models, but no details on the supervised baselines, training regimes, or cross-validation procedure are supplied, leaving the superiority claim unsupported.

minor comments (2)

[§3.3] Notation for the dual-axis attention is introduced without an explicit equation; adding a compact formulation (e.g., Eq. (X)) would improve clarity.
[Abstract] The manuscript states 'Code will be released soon' but provides no link or repository; a concrete availability statement is needed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification. We address each major point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the central claims of outperformance on forecasting and decoding are stated without any numerical metrics, error bars, dataset sizes (number of neurons, sessions, animals), ablation results, or statistical tests, preventing evaluation of the reported gains over baselines.

Authors: We agree that the abstract and narrative in §4 are qualitative. The quantitative results—including specific performance metrics, error bars, dataset statistics (e.g., neuron counts, session and animal numbers), ablation tables, and statistical tests—are provided in the figures and tables of §4. In the revision we will insert the key numerical values and explicit cross-references into the main text of §4 and the abstract to make the gains immediately evaluable. revision: yes
Referee: [§3.2 and §3.3] §3.2 (Tokenizer) and §3.3 (Dual-axis transformer): the discretization thresholds and vocabulary size are listed as free parameters, yet no sensitivity analysis or ablation is provided to show that the claimed transferability does not depend on these choices.

Authors: The chosen thresholds and vocabulary size were determined via preliminary tuning for reconstruction fidelity and computational tractability. While the manuscript does not contain a dedicated sensitivity study, we recognize that explicit ablations would better support the transferability claim. We will add a new subsection (or appendix) reporting performance across a range of discretization thresholds and vocabulary sizes on the forecasting and decoding tasks. revision: yes
Referee: [§4.2] §4.2 (Behavior decoding): the adaptation with a task-specific head is asserted to surpass supervised decoding models, but no details on the supervised baselines, training regimes, or cross-validation procedure are supplied, leaving the superiority claim unsupported.

Authors: We regret the lack of these implementation details. The supervised baselines comprise standard models (linear regression, LSTM, and transformer variants) trained on identical data partitions and with the same cross-validation folds used for CalM. In the revision we will expand §4.2 with explicit descriptions of each baseline architecture, training hyperparameters, optimization settings, and the cross-validation protocol, together with the corresponding performance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical self-supervised pretraining framework (tokenizer + dual-axis autoregressive transformer) for calcium traces, with performance claims resting on direct comparisons to external specialized baselines on forecasting and decoding tasks. No load-bearing derivation reduces by construction to fitted parameters, self-citations, or self-definitional quantities. The pretraining objective and architecture choices are stated independently of the reported downstream gains, and linear analyses of representations are post-hoc interpretations rather than circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that calcium traces contain transferable structure across animals and sessions that self-supervised learning can extract. Standard machine-learning hyperparameters are present but not enumerated in the abstract.

free parameters (2)

Tokenizer vocabulary size and discretization thresholds
Chosen to map continuous calcium traces into a shared discrete vocabulary; value not specified in abstract.
Transformer layer count, hidden dimension, and attention heads
Model capacity hyperparameters tuned during pretraining; not reported in abstract.

axioms (1)

domain assumption Calcium imaging traces from multiple animals and sessions share common underlying population dynamics that can be captured by self-supervised pretraining.
Invoked to justify why a single pretrained model can adapt to forecasting and decoding without task-specific pretraining.

pith-pipeline@v0.9.0 · 5524 in / 1387 out tokens · 37241 ms · 2026-05-13T18:27:10.581242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-performance vector-quantized (VQ) tokenizer ... dual-axis autoregressive transformer ... autoregressive objective ... CE loss
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no reference to recognition cost J(x), phi-ladder or 8-tick period

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Antoniades, A., Yu, Y ., Canzano, J., Wang, W., and Smith, S. L. Neuroformer: Multimodal and multitask generative pretraining for brain data.arXiv preprint arXiv:2311.00136,

work page arXiv
[2]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bai, S. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[5]

T., Ahrens, M

Duan, Y ., Chaudhry, H. T., Ahrens, M. B., Harvey, C. D., Perich, M. G., Deisseroth, K., and Rajan, K. Poco: Scal- able neural forecasting through population conditioning. arXiv preprint arXiv:2506.14957,

work page arXiv
[6]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., and Poole, B. Categorical repa- rameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Jolliffe, I. T. and Cadima, J. Principal component analysis: a review and recent developments.Philosophical transac- tions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202,

work page 2065
[8]

W., Miller, A

Linderman, S. W., Miller, A. C., Adams, R. P., Blei, D. M., Paninski, L., and Johnson, M. J. Recurrent switching lin- ear dynamical systems.arXiv preprint arXiv:1610.08466,

work page arXiv
[9]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Liu, Y ., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., and Long, M. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625,

work page internal anchor Pith review arXiv
[10]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Nie, Y . A time series is worth 64words: Long-term forecast- ing with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review arXiv
[11]

G., Chall´u, C., Garza, A., Canseco, M

Olivares, K. G., Chall´u, C., Garza, A., Canseco, M. M., and Dubrawski, A. NeuralForecast: User friendly state-of- the-art neural forecasting models. PyCon Salt Lake City, Utah, US 2022,

work page 2022
[12]

GLU Variants Improve Transformer

Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Lfads-latent factor analysis via dynamical systems.arXiv preprint arXiv:1608.06315,

Sussillo, D., Jozefowicz, R., Abbott, L., and Pandarinath, C. Lfads-latent factor analysis via dynamical systems.arXiv preprint arXiv:1608.06315,

work page arXiv
[15]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

I., Paninski, L., Hurwitz, C

Xia, J., Zhang, Y ., Wang, S., Allen, G. I., Paninski, L., Hurwitz, C. L., and Miller, K. D. Inpainting the neural picture: Inferring unrecorded brain area dynamics from multi-animal datasets.arXiv preprint arXiv:2510.11924,

work page arXiv
[17]

arXiv preprint arXiv:2108.01210 , year=

Ye, J. and Pandarinath, C. Representation learning for neural population activity with neural data transformers.arXiv preprint arXiv:2108.01210,

work page arXiv
[18]

Exploiting correlations across trials and behavioral sessions to improve neural decoding.Neuron, 2025a

Zhang, Y ., Lyu, H., Hurwitz, C., Wang, S., Findling, C., Wang, Y ., Hubert, F., Pouget, A., Varol, E., and Paninski, L. Exploiting correlations across trials and behavioral sessions to improve neural decoding.Neuron, 2025a. Zhang, Y ., Wang, Y ., Azabou, M., Andre, A., Wang, Z., Lyu, H., Laboratory, T. I. B., Dyer, E., Paninski, L., and Hurwitz, C. Neura...

work page arXiv 2022
[19]

St ∼Poisson( tanh (rt) + 1 2 ×dt×λ max)(17) K= exp (− t τr )−exp (− t τca )(18) We generate three sessions using three different random seeds

is added to the traces. St ∼Poisson( tanh (rt) + 1 2 ×dt×λ max)(17) K= exp (− t τr )−exp (− t τca )(18) We generate three sessions using three different random seeds. Each session consists of 400 trials, which are split into training, validation, and test sets with a ratio of 70:15:15. Each trial contains calcium traces from 200 neurons over 100 time step...

work page 2004
[20]

Neural activity is represented as a collection of univariate time series

is implemented using the NeuralForecast (Olivares et al., 2022). Neural activity is represented as a collection of univariate time series. PatchTST tokenizes the input sequence into overlapping temporal patches. In our implementation, the patch length is set to 8 time steps with a stride of

work page 2022
[21]

We set the model dimension 64 with 4 attention heads and 2 transformer layers

is also implemented using the NeuralForecast framework and follows the same data representation and evaluation protocol as PatchTST. We set the model dimension 64 with 4 attention heads and 2 transformer layers. Dropout is set to 0.1. The model is trained using the Adam optimizer with a learning rate of10 −3. POCO For POCO (Duan et al., 2025), we preproce...

work page 2025
[22]

Only the learning rate is adjusted to ensure effective training

on multi-session decoding task, we perform a broad hyperparameter search on a small dataset containing 9 sessions and apply the best hyperparameters to the full 189 pre-train dataset. Only the learning rate is adjusted to ensure effective training. For single session decoding, we perform a grid search on model size, latent step, number of latents and drop...

work page 2079