arxiv: 2605.04309 · v1 · submitted 2026-05-05 · 💻 cs.NE

Recognition: unknown

Interpreting V1 Population Activity via Image-Neural Latent Representation Alignment

Xin Wang , Zhuangzhi Gao , Hongyi Qin , Zhongli Wu , Feixiang Zhou , He Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:07 UTC · model grok-4.3

classification 💻 cs.NE

keywords V1neural decodingcontrastive alignmentlow-level featurescalcium imagingvisual cortexpopulation activityinterpretable decoding

0 comments

The pith

A dual-tower alignment model shows that V1 population activity supports decoding mainly through coarse low-level visual features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Dual-Tower Image-Neural Alignment (DINA) to connect visual images with neural responses from the primary visual cortex. By training separate pathways for images and brain data to align their intermediate features, it achieves good decoding of what was seen from the activity patterns. The results indicate that this success comes mostly from matching broad shapes and textures across the image, not from detailed edges or object meanings. This setup also highlights that only certain groups of neurons contribute strongly to these alignments.

Core claim

DINA jointly trains a biologically motivated dual-tower architecture that aligns visual stimuli and corresponding V1 population responses in a shared latent space at the level of intermediate feature maps. This enables both accurate decoding and direct access to interpretable feature maps. Evaluated on two-photon calcium imaging data from mouse V1, it reveals that decoding performance is primarily supported by coarse, low-level visual structure from multiple spatially distributed image regions, captured by sparse subsets of strongly responsive neurons.

What carries the argument

Dual-tower architecture aligning image and neural responses at intermediate feature map levels through contrastive training.

Load-bearing premise

That the feature alignments learned between images and neural data correspond to the actual computational processes in V1 rather than being shaped by the specific training method or data used.

What would settle it

Observing whether decoding accuracy drops significantly when the model is restricted to only fine details or semantic categories, or if new V1 recordings fail to show similar alignment patterns with low-level structures.

Figures

Figures reproduced from arXiv: 2605.04309 by Feixiang Zhou, He Zhao, Hongyi Qin, Xin Wang, Zhongli Wu, Zhuangzhi Gao.

**Figure 1.** Figure 1: Illustration of feature-level contrastive alignment. Visual stimuli are transformed into intermediate image feature maps, while neural activities reconstruct intermediate neural feature maps. Encoding and decoding can be viewed as aligning these heterogeneous representations in a shared feature space, raising the central question of how image-derived and neural-derived feature maps can be aligned. Moreove… view at source ↗

**Figure 2.** Figure 2: Architecture of the DINA model. An image tower and a neural tower independently project visual stimuli and V1 population responses into feature maps that are matched in dimensionality and aligned using a contrastive objective. can be read out (Naselaris et al., 2011). Early decoding work primarily employed discriminative deep models to perform visual stimulus classification or identification from neural ac… view at source ↗

**Figure 3.** Figure 3: Top-5 image retrieval results for representative neural queries from the test set view at source ↗

**Figure 4.** Figure 4: Image-neural alignment reflects coarse, low-level visual structure. (a) Latent feature maps produced by the image and neural towers for corresponding stimuli. (b) Variance spectra of natural images and aligned feature maps. (c) Image retrieval accuracy under different stimulus conditions, including natural, whitened, and low-dimensional (8D) images. sentations (Stringer et al., 2019). These results support… view at source ↗

**Figure 5.** Figure 5: Image-tower interpretability reveals distributed spatial contributions to aligned feature maps. (a) Pathway masking analysis showing the structural similarity (SSIM) between feature maps obtained after masking either the local or global pathway and the original feature map. (b) Gaussian-windowed occlusion analysis applied to a representative stimulus image to estimate spatial contributions, using smooth ke… view at source ↗

read the original abstract

Understanding the neural mechanisms underlying visual computation has long been a central challenge in neuroscience. Recent alignment based approaches have improved the accuracy of decoding visual stimuli from brain activity, yet they provide limited insight into the neural computations that give rise to these improvements. To address this gap, we propose Dual-Tower Image-Neural Alignment (DINA), an interpretable contrastive framework for analyzing population level visual computations in primary visual cortex (V1). DINA jointly trains a biologically motivated dual-tower architecture that aligns visual stimuli and corresponding V1 population responses in a shared latent space at the level of intermediate feature maps, enabling both accurate decoding and direct access to interpretable feature maps. Evaluated on large-scale two-photon calcium imaging data from mouse V1, DINA achieves accurate neural-based decoding while revealing that decoding performance is primarily supported by coarse, low-level visual structure, rather than semantic category information or fine-grained details. Further analysis reveals that alignable feature maps emerge from multiple spatially distributed image regions, capturing both shape and texture cues, and are predominantly reconstructed by sparse subsets of strongly responsive neurons and their functional interactions. Together, these results confirm that, beyond enabling accurate decoding, DINA provides a principled framework for probing the computational mechanisms underlying visual processing in V1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DINA adds a dual-tower contrastive alignment at intermediate feature maps for V1 calcium data, but the claim that low-level structure drives performance needs controls to rule out training artifacts.

read the letter

The core of this paper is a dual-tower contrastive model called DINA that aligns image patches with V1 population responses at the level of intermediate feature maps. It reports solid decoding accuracy on large-scale two-photon mouse data and concludes that coarse, low-level visual structure explains most of the performance, with alignable maps coming from distributed regions and sparse neuron subsets plus their interactions. That framing is the main new piece: prior decoding work either stays black-box or uses simpler correlations, while this one tries to keep the aligned representations directly inspectable as maps. The scale of the imaging data and the attempt to link back to shape and texture cues are the parts that feel grounded. The sparse-neuron reconstruction angle is a concrete observation worth checking. The soft spot is exactly the one the stress-test flags. The central interpretation—that the alignment reveals what V1 actually uses—rests on the assumption that the contrastive objective is not itself biasing toward low-level shared statistics between images and responses. The abstract gives no sign of the obvious checks: running the same architecture with a supervised or reconstruction loss, testing whether semantic features can be forced to align under a modified objective, or comparing against shuffled or phase-scrambled controls. Without those, it is hard to know whether the low-level dominance is a property of V1 or of the training procedure. The methods section will need to show the full training details, hyperparameter choices, and any post-hoc selection steps before the claim can be taken at face value. This is the kind of work that belongs in a reading group for people who build stimulus-neural alignment models or who run V1 imaging experiments. It is not yet ready to cite as settled evidence on V1 computation, but the architecture itself is simple enough that others could test the missing controls quickly. A serious editor should send it to review; the idea is clear, the data are real, and the gaps are fixable with targeted experiments rather than a full rewrite.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Dual-Tower Image-Neural Alignment (DINA), a contrastive framework using a biologically motivated dual-tower architecture to align visual stimuli and V1 population responses in a shared latent space at intermediate feature maps. Evaluated on large-scale two-photon calcium imaging data from mouse V1, DINA is reported to achieve accurate neural-based decoding while showing that performance is primarily supported by coarse, low-level visual structure rather than semantic category information or fine-grained details. Further analyses indicate that alignable feature maps arise from multiple spatially distributed image regions (capturing shape and texture) and are reconstructed by sparse subsets of strongly responsive neurons and their functional interactions.

Significance. If the central claims hold, the work would be significant as an interpretable alternative to black-box decoding methods, linking decoding accuracy directly to specific visual feature types in V1 and providing a framework for probing computational mechanisms. Strengths include the large-scale empirical evaluation on two-photon data and the emphasis on intermediate feature-map alignment, which could bridge ML models with biological insights more mechanistically than standard approaches.

major comments (1)

[Methods (contrastive loss definition) and Results (feature map analysis)] The central claim that decoding performance is primarily supported by coarse low-level structure (rather than semantics or fine details) is load-bearing for the interpretation of V1 mechanisms. However, the abstract and methods provide no controls separating this from an artifact of the contrastive objective (e.g., no ablation comparing contrastive loss to supervised classification or reconstruction-based objectives to test whether semantic features can be aligned when the loss is modified to emphasize them). This directly affects whether the alignment reflects V1 computations or shared low-level image statistics induced by training.

minor comments (2)

The description of the dual-tower architecture and how intermediate feature maps are selected for alignment could be expanded with a diagram or pseudocode for reproducibility.
Notation for the shared latent space and the contrastive loss terms should be defined more explicitly upon first use to aid readers unfamiliar with contrastive frameworks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the single major comment below and have prepared revisions to strengthen the work.

read point-by-point responses

Referee: [Methods (contrastive loss definition) and Results (feature map analysis)] The central claim that decoding performance is primarily supported by coarse low-level structure (rather than semantics or fine details) is load-bearing for the interpretation of V1 mechanisms. However, the abstract and methods provide no controls separating this from an artifact of the contrastive objective (e.g., no ablation comparing contrastive loss to supervised classification or reconstruction-based objectives to test whether semantic features can be aligned when the loss is modified to emphasize them). This directly affects whether the alignment reflects V1 computations or shared low-level image statistics induced by training.

Authors: We agree that explicit loss-function ablations would provide stronger evidence that the observed preference for coarse, low-level structure is not an artifact of the contrastive objective alone. Our current feature-map analyses already show that alignable representations arise from spatially distributed image regions encoding shape and texture rather than fine details or semantic categories, consistent with known V1 properties. Nevertheless, to directly test whether alternative objectives can align semantic information, the revised manuscript will include new experiments that replace the contrastive loss with (i) a supervised classification objective using semantic labels and (ii) a reconstruction-based objective. We will report the resulting alignment quality, decoding accuracy, and feature-map characteristics under each regime. This addition will clarify the extent to which the low-level bias is objective-dependent versus reflective of V1 population statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: DINA is a new contrastive training procedure whose outputs are not definitionally equivalent to its inputs

full rationale

The paper introduces Dual-Tower Image-Neural Alignment (DINA) as an independent contrastive training framework that aligns image stimuli and V1 responses at intermediate feature maps. The claimed result—that decoding relies primarily on coarse low-level structure—is presented as an empirical outcome obtained by inspecting the trained alignment and performing further analysis on the resulting feature maps and neuron subsets. No equations, self-citations, or fitted parameters are shown that reduce this conclusion to a restatement of the alignment objective itself. The derivation chain therefore consists of a novel architecture plus post-hoc inspection rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the DINA framework itself; the central claim rests on the unstated assumption that contrastive alignment captures biologically meaningful computations.

pith-pipeline@v0.9.0 · 5532 in / 1161 out tokens · 26362 ms · 2026-05-08T17:07:50.696065+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 2 canonical work pages

[1]

Nature , volume=

High-dimensional geometry of population responses in visual cortex , author=. Nature , volume=. 2019 , publisher=

2019
[2]

Advances in Neural Information Processing Systems , year=

Dual-stream Network for Visual Recognition , author=. Advances in Neural Information Processing Systems , year=
[3]

Journal of Neuroscience , volume=

Circuits for local and global signal integration in primary visual cortex , author=. Journal of Neuroscience , volume=. 2002 , publisher=

2002
[4]

, author=

Neural mechanisms involved in the processing of global and local aspects of hierarchically organized visual stimuli. , author=. Brain: a journal of neurology , volume=
[5]

The Journal of Physiology , year=

Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , author=. The Journal of Physiology , year=
[6]

Vision Research , year=

Spatial frequency selectivity of cells in macaque visual cortex , author=. Vision Research , year=
[7]

Annual Review of Neuroscience , year=

Natural image statistics and neural representation , author=. Annual Review of Neuroscience , year=
[8]

Forty-second International Conference on Machine Learning , year=

Human-Aligned Image Models Improve Visual Decoding from the Brain , author=. Forty-second International Conference on Machine Learning , year=
[9]

BrainCLIP: Brain Representation via CLIP for Generic Natural Visual Stimulus Decoding , year=

Ma, Yongqiang and Liu, Yulong and Chen, Liangjun and Zhu, Guibo and Chen, Badong and Zheng, Nanning , journal=. BrainCLIP: Brain Representation via CLIP for Generic Natural Visual Stimulus Decoding , year=
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Shizun and Liu, Songhua and Tan, Zhenxiong and Wang, Xinchao , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[11]

Journal of Neurophysiology , volume=

An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex , author=. Journal of Neurophysiology , volume=
[12]

Proceedings of the National Academy of Sciences , volume=

Performance-optimized hierarchical models predict neural responses in higher visual cortex , author=. Proceedings of the National Academy of Sciences , volume=
[13]

Journal of Neuroscience , volume=

Deep neural networks reveal a gradient in the complexity of neural representations across the ventral visual stream , author=. Journal of Neuroscience , volume=
[14]

Neuron , volume=

Modeling sensory neural responses with deep neural networks , author=. Neuron , volume=
[15]

PLoS Computational Biology , volume=

Learning receptive fields of individual neurons in visual cortex by deep learning , author=. PLoS Computational Biology , volume=
[16]

Journal of Neuroscience , volume=

Inception loops discover what excites neurons most using deep predictive models , author=. Journal of Neuroscience , volume=
[17]

PLoS computational biology , volume=

Deep neural networks rival the representation of primate IT cortex for core visual object recognition , author=. PLoS computational biology , volume=. 2014 , publisher=

2014
[18]

Journal of the Optical Society of America A , volume=

Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters , author=. Journal of the Optical Society of America A , volume=. 1985 , publisher=

1985
[19]

Vision research , volume=

Spatial frequency selectivity of cells in macaque visual cortex , author=. Vision research , volume=. 1982 , publisher=

1982
[20]

Nature neuroscience , volume=

Decoding the visual and subjective contents of the human brain , author=. Nature neuroscience , volume=. 2005 , publisher=

2005
[21]

Frontiers in neuroscience , volume=

Category decoding of visual stimuli from human brain activity using a bidirectional recurrent neural network to simulate bidirectional information flows in human visual cortices , author=. Frontiers in neuroscience , volume=. 2019 , publisher=

2019
[22]

Current biology , volume=

Reconstructing visual experiences from brain activity evoked by natural movies , author=. Current biology , volume=. 2011 , publisher=

2011
[23]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image reconstruction with latent diffusion models from human brain activity , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[24]

Nature communications , volume=

Generic decoding of seen and imagined objects using hierarchical visual features , author=. Nature communications , volume=. 2017 , publisher=

2017
[25]

PLoS computational biology , volume=

Deep image reconstruction from human brain activity , author=. PLoS computational biology , volume=. 2019 , publisher=

2019
[26]

Biomedical Signal Processing and Control , volume=

Visual saliency decoding algorithm based on EEG signals , author=. Biomedical Signal Processing and Control , volume=. 2026 , publisher=

2026
[27]

IEEE Transactions on Neural Networks and Learning Systems , year=

Recognizing Natural Images From EEG With Language-Guided Contrastive Learning , author=. IEEE Transactions on Neural Networks and Learning Systems , year=
[28]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[29]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[30]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

MINDEV: Multi-modal Integrated Diffusion Framework for Video Reconstruction from EEG Signals , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
[31]

Proceedings of the National Academy of Sciences , volume=

Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , author=. Proceedings of the National Academy of Sciences , volume=
[32]

Nature Communications , volume=

Aligning model representations with brain responses improves neural predictivity , author=. Nature Communications , volume=
[33]

2018 , publisher =

Responses of ten thousand neurons to 2,800 natural images , author =. 2018 , publisher =. doi:10.25378/janelia.6845348.v3 , url =

work page doi:10.25378/janelia.6845348.v3 2018
[34]

International Conference on Learning Representations (ICLR) , year=

Very deep convolutional networks for large-scale image recognition , author=. International Conference on Learning Representations (ICLR) , year=
[35]

International Conference on Learning Representations (ICLR) , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. International Conference on Learning Representations (ICLR) , year=
[36]

The Journal of physiology , volume=

Receptive fields and functional architecture of monkey striate cortex , author=. The Journal of physiology , volume=. 1968 , publisher=

1968
[37]

Journal of neurophysiology , volume=

Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex , author=. Journal of neurophysiology , volume=. 2002 , publisher=

2002
[38]

Nature neuroscience , volume=

Using goal-driven deep learning models to understand sensory cortex , author=. Nature neuroscience , volume=. 2016 , publisher=

2016
[39]

BioRxiv , pages=

Brain-score: Which artificial neural network for object recognition is most brain-like? , author=. BioRxiv , pages=. 2018 , publisher=

2018
[40]

Frontiers in Computational Neuroscience , volume=

Quantifying the brain predictivity of artificial neural networks with nonlinear response mapping , author=. Frontiers in Computational Neuroscience , volume=. 2021 , publisher=

2021
[41]

IEEE Transactions on Medical Imaging , year=

BrainCLIP: Brain representation via CLIP for generic natural visual stimulus decoding , author=. IEEE Transactions on Medical Imaging , year=
[42]

IEEE Transactions on Circuits and Systems for Video Technology , year=

fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment , author=. IEEE Transactions on Circuits and Systems for Video Technology , year=
[43]

Mindeye2: Shared-subject mod- els enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data , author=. arXiv preprint arXiv:2403.11207 , year=

work page arXiv
[44]

Neuroimage , volume=

Encoding and decoding in fMRI , author=. Neuroimage , volume=. 2011 , publisher=

2011
[45]

Advances in neural information processing systems , volume=

Reconstructing perceived faces from brain activations with deep adversarial neural decoding , author=. Advances in neural information processing systems , volume=
[46]

NeuroImage , volume=

Generative adversarial networks for reconstructing natural images from brain activity , author=. NeuroImage , volume=. 2018 , publisher=

2018
[47]

IEEE Transactions on Image Processing , year=

NeuralDiffuser: Neuroscience-Inspired Diffusion Guidance for fMRI Visual Reconstruction , author=. IEEE Transactions on Image Processing , year=
[48]

Current opinion in neurobiology , volume=

Interpreting encoding and decoding models , author=. Current opinion in neurobiology , volume=. 2019 , publisher=

2019
[49]

Cell , volume=

Decoding the brain: From neural representations to mechanistic models , author=. Cell , volume=. 2024 , publisher=

2024
[50]

Advances in neural information processing systems , volume=

Multimodal deep learning model unveils behavioral dynamics of V1 activity in freely moving mice , author=. Advances in neural information processing systems , volume=
[51]

Advanced Science , volume=

Predicting Single Neuron Responses of the Primary Visual Cortex with Deep Learning Model , author=. Advanced Science , volume=. 2024 , publisher=

2024
[52]

Cell , volume=

High-precision coding in visual cortex , author=. Cell , volume=. 2021 , publisher=

2021
[53]

Nature neuroscience , volume=

A functional and perceptual signature of the second visual area in primates , author=. Nature neuroscience , volume=. 2013 , publisher=

2013
[54]

Nature Reviews Neuroscience , volume =

The log-dynamic brain: how skewed distributions affect network operations , author =. Nature Reviews Neuroscience , volume =
[55]

Science , volume =

Neuronal oscillations in cortical networks , author =. Science , volume =