Can neurons speak? Semantic narration of vision at single-cell resolution

Arnau Marin-Llobet; Demba Ba; Na Li; Richard Hakim; Sara Matias; Venkatesh N. Murthy

arxiv: 2606.18667 · v1 · pith:IPYMHS7Wnew · submitted 2026-06-17 · 🧬 q-bio.NC · q-bio.QM

Can neurons speak? Semantic narration of vision at single-cell resolution

Arnau Marin-Llobet , Richard Hakim , Sara Matias , Venkatesh N. Murthy , Na Li , Demba Ba This is my paper

Pith reviewed 2026-06-26 18:50 UTC · model grok-4.3

classification 🧬 q-bio.NC q-bio.QM

keywords neural decodingvisual cortexnatural language generationsingle neuronspike trainsCLIP embeddingsmouse visual systemcell-type contribution

0 comments

The pith

NEURRATOR converts spiking activity from single neurons into natural-language descriptions of viewed scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that takes spike trains from arbitrary numbers of neurons recorded simultaneously in mouse visual cortex and maps them into the embedding space of a frozen vision-language model. From there it generates and validates free-form English narrations of natural movies without any language-side training. A reader would care because this turns the opaque responses of individual cells into readable statements about what the animal is seeing, rather than leaving them as abstract numbers or black-box vectors. The approach works on whole populations, single cortical areas, local groups, or genetically tagged cell types, and it measures how well decoding improves as more neurons are included.

Core claim

A learned encoder maps spike trains from any chosen subset of recorded neurons into the patch-embedding space of a frozen CLIP model; a multimodal language model then produces a description of the visual stimulus and a sparse autoencoder validates it, all without training on the language side. Applied to Neuropixel data from mouse visual cortex during natural movie viewing, the same pipeline yields coherent narrations from thousands of neurons, from one region, from small local populations, or from molecularly defined inhibitory cell types.

What carries the argument

The learned encoder that projects arbitrary spike-train subsets into the frozen CLIP patch-embedding space.

If this is right

Decoding accuracy increases measurably with the number of neurons included and varies across cortical regions.
Individual neurons and genetically tagged cell types can each be described in plain language for their specific contribution to the scene.
Cell identity can be treated as a functional probe that reveals what part of the visual world a given neuron represents.
The same pipeline works on populations of any size or composition without retraining the language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be used to compare semantic content across different visual areas or across different behavioral states in the same animal.
If the narrations remain stable when subsets of neurons are removed, that would indicate which cells carry redundant versus unique semantic information.
The approach might be tested on data from other sensory modalities to see whether the same encoder-to-language route applies outside vision.

Load-bearing premise

The encoder's embeddings in CLIP space keep the semantic content of the original visual stimulus without systematic distortion that would break the downstream language generation.

What would settle it

Present the same neurons with a set of short, controlled clips that differ only in one semantic feature (such as the presence or absence of a moving object) and test whether the generated narrations reliably mention that feature when the neurons are active and omit it when they are not.

Figures

Figures reproduced from arXiv: 2606.18667 by Arnau Marin-Llobet, Demba Ba, Na Li, Richard Hakim, Sara Matias, Venkatesh N. Murthy.

**Figure 1.** Figure 1: NEURRATOR: language-aligned readout from spiking activity. (A) A learned neural encoder maps spike trains into the joint CLIP embedding space shared with the frozen CLIP image encoder; a frozen LLaVA then decodes the predicted embedding into a free-form description of the viewed scene. (B) Representative example decodings on held-out test frames from a natural movie. Image-to-text captions (gray) are produ… view at source ↗

**Figure 2.** Figure 2: Overview of the Neurrator framework. A trainable Neurrator Encoder processes Neuropixel recordings from a mouse viewing a visual stimulus, mapping spike trains to visual patch embeddings via multi-scale Conv1D layers, transformer encoders, and learned patch queries with cross-attention. A PatchInjector hooks into the frozen LLaVA model at runtime, replacing the output of its vision tower with the predicted… view at source ↗

**Figure 3.** Figure 3: Held-out narration quality. Left: semantic accuracy over time for contiguous-middle and front-only holdouts. Right: SBERT cosine vs random-sentence floor; *** p < 0.001. Across both regimes, decoded narrations on held-out frames remain semantically aligned with the visual content. We use Sentence-BERT (SBERT) as a metric to measure its semantic similarity (SBERT cosine: a sentence-level semantic similarity… view at source ↗

**Figure 4.** Figure 4: Decoding scales with neuron count across visual areas and animals. SBERT similarity between decoded narrations and ground-truth captions vs. number of neurons, by region. 0.170 ± 0.085 versus a 0.062 ± 0.073 random floor (∆ = +0.108, p < 0.001). Inspection of the example narrations ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-narration SBERT similarity by brain region and cell type (NM1 test frames, shared cosine scale 0–0.75). Region pools (left) collapse onto a single visual cluster; only hippocampus drops near the shuffled-GT floor. Cell-type pools (right) show PV–SST clustering with VIP separated. Region pooling collapses; cell-type pooling separates. We next asked whether different subpopulations produce semantical… view at source ↗

**Figure 6.** Figure 6: Per-frame narration excerpts on NM1. Two test frames where PV and SST describe the visible cars while VIP foregrounds lighting and shadow of the same scene. Optotagged cell types produce semantically distinct narrations. Switching the population label from anatomy to genetic identity inverts the picture seen at the regional level. To compare what the three cell-type pools say about the same movie, we compu… view at source ↗

**Figure 7.** Figure 7: Cell-type-specific decoding on NM1. (A) Per-frame SBERT similarity to BLIP-2 captions. (B,C) Time-resolved cosine of decoded narrations to “darkness or shadows” (B) and “a car or vehicle” (C); lines: smoothed per-cell-type means, dots: individual frames. What does each cell-type actually see? Narrations hint at differences but reveal nothing about the underlying visual content driving them. To recover reco… view at source ↗

**Figure 8.** Figure 8: Cell-type-unique SAE features map onto interpretable visual concepts. Left: z-scored mean activation of five “unique-by-magnitude” SAE features across cell types. Right: top ImageNet1k images per feature. them apart. We need to check that this tail is a real property of each population and not an artifact of the particular test bins we happened to evaluate on. The standard tool for this is a bootstrap: we… view at source ↗

read the original abstract

Identifying what individual neurons encode in higher-order visual cortex is an open problem. Responses resist intuitive parameterization, and the deep-network embeddings used in their place are black boxes. Here, we introduce NEURRATOR, a framework that decodes spiking activity into free-form natural-language narration of the viewed scene at single-neuron resolution. A learned encoder maps spike trains from arbitrary subsets of simultaneously-recorded neurons into the patch-embedding space of a frozen CLIP, from which a multimodal language model and sparse autoencoder generates and validates a description with no language-side training. Applied to Neuropixel recordings of mouse visual cortex during natural-movie viewing, NEURRATOR narrates from thousands of neurons, singular cortical regions, local populations, or from a molecularly-defined cell-types. We use this property to (i) quantify how decoding fidelity scales with population size and cortical region, and (ii) "neurrate", in plain language, what individual neurons and genetically-tagged inhibitory cell-types contribute to visual representation. This recasts cell identity from a classification target into a functional probe of the visual system, providing a new unit of biological insights in neural systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NEURRATOR routes spikes to CLIP for language narration at single-cell level but the abstract contains zero numbers or examples to check if it works.

read the letter

The main thing to know is that this paper introduces NEURRATOR, which encodes spike trains from arbitrary groups of neurons into the embedding space of a frozen CLIP model and then generates natural-language descriptions of the visual stimulus. They apply it to Neuropixel recordings from mouse visual cortex during natural movies and use the outputs to describe what single neurons or genetically tagged inhibitory types appear to represent.

What is actually new is the pipeline that maps neural data directly into CLIP patch space for zero-shot text generation with no language-side training. The work shows the method can run on thousands of neurons, on single cortical regions, on local populations, or on molecularly defined cell types, and it reports how decoding fidelity changes with population size and area.

The application to real data and the scaling measurements are concrete steps forward. Treating cell identity as a functional probe rather than a classification label is a reasonable reframing, and the frozen-CLIP choice keeps the language side from fitting to the neural recordings.

The soft spot is straightforward: the abstract supplies no quantitative results. There are no accuracy metrics, no narration examples, no ablation results on the encoder, and no description of how the spike-to-CLIP mapping is trained or how output fidelity is scored. The central assumption that the embeddings preserve semantic content without systematic distortion therefore has no reported evidence either way. The stress-test note correctly flags that no load-bearing technical objection can be raised from the given text alone, but that also means the claim rests on unshown work.

This is for readers in systems neuroscience who want to explore language-model bridges for neural interpretability. It could be worth a serious referee if the full manuscript contains the missing validation and methods details; the idea is clear enough that referees could evaluate the technical choices directly.

Referee Report

2 major / 0 minor

Summary. The paper introduces NEURRATOR, a framework that decodes spiking activity from arbitrary subsets of neurons recorded in mouse visual cortex during natural movie viewing into free-form natural-language narrations of the viewed scene at single-neuron resolution. A learned encoder maps spike trains into the patch-embedding space of a frozen CLIP model; a multimodal language model together with a sparse autoencoder then generates and validates descriptions with no language-side training. The method is applied to Neuropixels data to quantify how decoding fidelity scales with population size and cortical region and to narrate the functional contributions of individual neurons and molecularly-defined inhibitory cell types.

Significance. If the central mapping from spikes to semantically faithful CLIP embeddings can be shown to hold with appropriate controls, the work would recast single-cell identity as a functional probe and supply a new unit of insight into visual representation. The explicit use of a frozen external model and absence of language-side training on the neural data constitute a clear strength against circularity.

major comments (2)

[Abstract] Abstract: the central claim that NEURRATOR produces accurate single-cell narrations is unsupported by any reported quantitative validation metrics, error bars, ablation studies, held-out test performance, or description of encoder training procedure and narration-fidelity scoring; without these the data-to-claim link cannot be evaluated.
[Abstract] Abstract: the statement that the encoder 'preserves the semantic content of the visual stimulus without systematic distortion' is presented as a premise rather than a result; no alignment metric, reconstruction fidelity, or control experiment is described that would test this load-bearing assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying areas where the abstract could more explicitly link claims to supporting evidence. We address each major comment below, clarifying where the manuscript already provides the requested details and indicating revisions we are prepared to make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that NEURRATOR produces accurate single-cell narrations is unsupported by any reported quantitative validation metrics, error bars, ablation studies, held-out test performance, or description of encoder training procedure and narration-fidelity scoring; without these the data-to-claim link cannot be evaluated.

Authors: The abstract is a concise summary; the full manuscript reports quantitative validation in the Results and Methods sections. Decoding fidelity is quantified as a function of population size and cortical region, with error bars derived from multiple held-out test splits and bootstrap resampling. Ablation studies compare the learned encoder against linear baselines and untrained mappings. The encoder training procedure (including loss, optimizer, and regularization) is detailed in Methods, and narration fidelity is scored via sparse autoencoder reconstruction error plus consistency with the multimodal LM on held-out stimuli. We will revise the abstract to include one or two key quantitative anchors (e.g., fidelity scaling) while remaining within length limits. revision: partial
Referee: [Abstract] Abstract: the statement that the encoder 'preserves the semantic content of the visual stimulus without systematic distortion' is presented as a premise rather than a result; no alignment metric, reconstruction fidelity, or control experiment is described that would test this load-bearing assumption.

Authors: The manuscript presents preservation of semantic content as an empirical outcome of the encoder's training objective (mapping spikes to frozen CLIP patch embeddings). Alignment is measured by cosine similarity and retrieval accuracy between encoded spike embeddings and ground-truth CLIP embeddings on held-out movies. Reconstruction fidelity is quantified by the sparse autoencoder's ability to recover the original CLIP patches, and control experiments include shuffled spike trains and random encoders to rule out systematic distortion. We will rephrase the abstract to frame this explicitly as a validated result rather than an assumption and will add a brief parenthetical reference to the alignment metric. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core pipeline maps spike trains via a learned encoder into the embedding space of a frozen external CLIP model, then uses an off-the-shelf multimodal language model and sparse autoencoder for narration generation, with explicit statement of no language-side training. No equations, training procedures, or self-citations are described that would reduce the reported narrations or fidelity metrics to quantities fitted from the same neural data used for evaluation. The derivation chain therefore remains self-contained against external benchmarks (CLIP, language models) and does not exhibit self-definitional, fitted-input, or self-citation load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method description mentions a learned encoder and frozen CLIP but supplies no equations or fitting details.

pith-pipeline@v0.9.1-grok · 5754 in / 1138 out tokens · 29788 ms · 2026-06-26T18:50:07.807393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 4 linked inside Pith

[1]

Neurons in the retina: organization, inhibition and excitation problems

Stephen W Kuffler. Neurons in the retina: organization, inhibition and excitation problems. InCold Spring Harbor Symposia on Quantitative Biology, volume 17, pages 281–292. Cold Spring Harbor Laboratory Press, 1952

1952
[2]

Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus.Journal of Neuroscience, 19(18):8036–8042, 1999

Garrett B Stanley, Fei F Li, and Yang Dan. Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus.Journal of Neuroscience, 19(18):8036–8042, 1999

1999
[3]

The connections of the middle temporal visual area (mt) and their relationship to a cortical hierarchy in the macaque monkey.Journal of Neuroscience, 3(12):2563–2586, 1983

John H Maunsell and David C van Essen. The connections of the middle temporal visual area (mt) and their relationship to a cortical hierarchy in the macaque monkey.Journal of Neuroscience, 3(12):2563–2586, 1983

1983
[4]

Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

1991
[5]

Inferotemporal cortex and vision.Progress in physiological psychology, 5:77–123, 1973

Charles G Gross. Inferotemporal cortex and vision.Progress in physiological psychology, 5:77–123, 1973

1973
[6]

Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex.Journal of neurophysiology, 71(3):856–867, 1994

Eucaly Kobatake and Keiji Tanaka. Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex.Journal of neurophysiology, 71(3):856–867, 1994

1994
[7]

Comparing face patch systems in macaques and humans.Proceedings of the National Academy of Sciences, 105(49):19514–19519, 2008

Doris Y Tsao, Sebastian Moeller, and Winrich A Freiwald. Comparing face patch systems in macaques and humans.Proceedings of the National Academy of Sciences, 105(49):19514–19519, 2008

2008
[8]

face cells

Kasper Vinken, Jacob S Prince, Talia Konkle, and Margaret S Livingstone. The neural code for “face cells” is not face-specific.Science advances, 9(35):eadg1736, 2023

2023
[9]

Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

2014
[10]

Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

2016
[11]

Deep supervised, but not unsupervised, models may explain it cortical representation.PLoS computational biology, 10(11):e1003915, 2014

Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation.PLoS computational biology, 10(11):e1003915, 2014

2014
[12]

Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

2018
[13]

The neural architecture of language: Integra- tive modeling converges on predictive processing.Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021

Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A Hosseini, Nancy Kan- wisher, Joshua B Tenenbaum, and Evelina Fedorenko. The neural architecture of language: Integra- tive modeling converges on predictive processing.Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021. 10

2021
[14]

Neural population control via deep image synthesis

Pouya Bashivan, Kohitij Kar, and James J DiCarlo. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019

2019
[15]

Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences.Cell, 177(4):999–1009, 2019

Carlos R Ponce, Will Xiao, Peter F Schade, Till S Hartmann, Gabriel Kreiman, and Margaret S Livingstone. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences.Cell, 177(4):999–1009, 2019

2019
[16]

Deep convolutional models improve predictions of macaque v1 responses to natural images.PLoS computational biology, 15(4):e1006897, 2019

Santiago A Cadena, George H Denfield, Edgar Y Walker, Leon A Gatys, Andreas S Tolias, Matthias Bethge, and Alexander S Ecker. Deep convolutional models improve predictions of macaque v1 responses to natural images.PLoS computational biology, 15(4):e1006897, 2019

2019
[17]

A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy.Neuron, 98(3):630–644, 2018

Alexander JE Kell, Daniel LK Yamins, Erica N Shook, Sam V Norman-Haignere, and Josh H McDermott. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy.Neuron, 98(3):630–644, 2018

2018
[18]

Dimensionality reduction for large-scale neural recordings.Nature neuroscience, 17(11):1500–1509, 2014

John P Cunningham and Byron M Yu. Dimensionality reduction for large-scale neural recordings.Nature neuroscience, 17(11):1500–1509, 2014

2014
[19]

Interpreting encoding and decoding models.Current opinion in neurobiology, 55:167–179, 2019

Nikolaus Kriegeskorte and Pamela K Douglas. Interpreting encoding and decoding models.Current opinion in neurobiology, 55:167–179, 2019

2019
[20]

Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

Steffen Schneider, Jin Hwa Lee, and Mackenzie Weygandt Mathis. Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

2023
[21]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[22]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

2021
[23]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[25]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022
[26]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[27]

Steering clip’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering clip’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

arXiv 2025
[28]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
[29]

https://transformer-circuits.pub/2023/monosemantic-features/index.html

2023
[30]

Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models.arXiv preprint arXiv:2502.12892, 2025

Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models.arXiv preprint arXiv:2502.12892, 2025

arXiv 2025
[31]

Survey of spiking in the mouse visual system reveals functional hierarchy.Nature, 592(7852):86–92, 2021

Joshua H Siegle, Xiaoxuan Jia, Séverine Durand, Sam Gale, Corbett Bennett, Nile Graddis, Greggory Heller, Tamina K Ramirez, Hannah Choi, Jennifer A Luviano, et al. Survey of spiking in the mouse visual system reveals functional hierarchy.Nature, 592(7852):86–92, 2021. 11

2021
[32]

Neuropixels 2.0: A miniaturized high- density probe for stable, long-term brain recordings.Science, 372(6539):eabf4588, 2021

Nicholas A Steinmetz, Cagatay Aydin, Anna Lebedeva, Michael Okun, Marius Pachitariu, Marius Bauza, Maxime Beau, Jai Bhagat, Claudia Böhm, Martijn Broux, et al. Neuropixels 2.0: A miniaturized high- density probe for stable, long-term brain recordings.Science, 372(6539):eabf4588, 2021

2021
[33]

Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

arXiv 2023
[34]

A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

2024
[35]

High-level visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. High-level visual representations in the human brain are aligned with large language models. Nature Machine Intelligence, 7(8):1220–1234, 2025

2025
[36]

Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

2016
[37]

Evidence of a predictive coding hierarchy in the human brain listening to speech.Nature human behaviour, 7(3):430–441, 2023

Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Evidence of a predictive coding hierarchy in the human brain listening to speech.Nature human behaviour, 7(3):430–441, 2023

2023
[38]

Brain encoding models based on multi- modal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

Jerry Tang, Meng Du, Vy V o, Vasudev Lal, and Alexander Huth. Brain encoding models based on multi- modal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

2023
[39]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14453–14463, 2023

2023
[40]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

arXiv 2024
[41]

Multiscale voxel based decoding for enhanced natural image reconstruction from brain activity

Mali Halac, Murat Isik, Hasan Ayaz, and Anup Das. Multiscale voxel based decoding for enhanced natural image reconstruction from brain activity. In2022 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2022

2022
[42]

Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, pages 1–9, 2023

Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G Huth. Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, pages 1–9, 2023

2023
[43]

Decoding speech from non-invasive brain recordings.arXiv preprint arXiv:2208.12266, 2022

Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech from non-invasive brain recordings.arXiv preprint arXiv:2208.12266, 2022

arXiv 2022
[44]

Tarr, and Leila Wehbe

Andrew Luo, Margaret Marie Henderson, Michael J. Tarr, and Leila Wehbe. BrainSCUBA: Fine-grained natural language captions of visual cortex selectivity. InThe Twelfth International Conference on Learning Representations, 2024

2024
[45]

Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature methods, 22(10):2107–2117, 2025

Elana Simon and James Zou. Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature methods, 22(10):2107–2117, 2025

2025
[46]

Sparse autoencoders uncover biologically interpretable features in protein language model representations.Proceedings of the National Academy of Sciences, 122(34):e2506316122, 2025

Onkar Gujral, Mihir Bafna, Eric Alm, and Bonnie Berger. Sparse autoencoders uncover biologically interpretable features in protein language model representations.Proceedings of the National Academy of Sciences, 122(34):e2506316122, 2025

2025
[47]

Be- yond black boxes: Enhancing interpretability of transformers trained on neural data.arXiv preprint arXiv:2506.14014, 2025

Laurence Freeman, Philip Shamash, Vinam Arora, Caswell Barry, Tiago Branco, and Eva Dyer. Be- yond black boxes: Enhancing interpretability of transformers trained on neural data.arXiv preprint arXiv:2506.14014, 2025

arXiv 2025
[48]

Neural models for detection and classification of brain states and transitions.Communica- tions Biology, 8(1):599, 2025

Arnau Marin-Llobet, Arnau Manasanch, Leonardo Dalla Porta, Melody Torao-Angosto, and Maria V Sanchez-Vives. Neural models for detection and classification of brain states and transitions.Communica- tions Biology, 8(1):599, 2025

2025
[49]

Physmap-interpretable in vivo neuronal cell type identification using multi-modal analysis of electrophysiological data.BioRxiv, pages 2024–02, 2024

Eric Kenji Lee, Asım Emre Gül, Greggory Heller, Anna Lakunina, Santiago Jaramillo, Pawel F Przytycki, and Chandramouli Chandrasekaran. Physmap-interpretable in vivo neuronal cell type identification using multi-modal analysis of electrophysiological data.BioRxiv, pages 2024–02, 2024. 12

2024
[50]

A deep learning strategy to identify cell types across species from high-density extracellular recordings.Cell, 188(8):2218–2234, 2025

Maxime Beau, David J Herzfeld, Francisco Naveros, Marie E Hemelt, Federico D’Agostino, Marlies Oostland, Alvaro Sánchez-López, Young Yoon Chung, Michael Maibach, Stephen Kyranakis, et al. A deep learning strategy to identify cell types across species from high-density extracellular recordings.Cell, 188(8):2218–2234, 2025

2025
[51]

In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pages 2024–11, 2025

Han Yu, Hanrui Lyu, Ethan Yixun Xu, Charlie Windolf, Eric Kenji Lee, Fan Yang, Andrew M Shelton, Shawn Olsen, Sahar Minavi, Olivier Winter, et al. In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pages 2024–11, 2025

2024
[52]

An ai agent for cell-type specific brain computer interfaces

Arnau Marin-Llobet, Zuwan Lin, Jongmin Baek, Almir Aljovic, Xinhe Zhang, Ariel J Lee, Wenbo Wang, Jaeyong Lee, Hao Shen, Yichun He, et al. An ai agent for cell-type specific brain computer interfaces. bioRxiv, 2025

2025
[53]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[54]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv, abs/1908.10084, 2019

Pith/arXiv arXiv 1908
[55]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015

2015
[56]

A cortical circuit for gain control by behavioral state.Cell, 156(6):1139–1152, 2014

Yu Fu, Jason M Tucciarone, J Sebastian Espinosa, Nengyin Sheng, Daniel P Darcy, Roger A Nicoll, Z Josh Huang, and Michael P Stryker. A cortical circuit for gain control by behavioral state.Cell, 156(6):1139–1152, 2014

2014
[57]

Cortical interneurons that specialize in disinhibitory control.Nature, 503(7477):521–524, 2013

Hyun-Jae Pi, Balázs Hangya, Duda Kvitsiani, Joshua I Sanders, Z Josh Huang, and Adam Kepecs. Cortical interneurons that specialize in disinhibitory control.Nature, 503(7477):521–524, 2013

2013
[58]

Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023
[59]

Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024
[60]

An odor is not worth a thousand words: from multidimensional odors to unidimensional odor objects.Annual review of psychology, 61(1):219–241, 2010

Yaara Yeshurun and Noam Sobel. An odor is not worth a thousand words: from multidimensional odors to unidimensional odor objects.Annual review of psychology, 61(1):219–241, 2010

2010
[61]

USER: <image>\n Describe this scene in one sentence.\n ASSISTANT:

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, and Stefan Scherer. Sparse clip: Co-optimizing interpretability and performance in contrastive learning.ArXiv, abs/2601.20075, 2026. 13 A Appendix This appendix collects supporting material referenced from the main text. Section A.2 provides full architectural and training-configuration details for ...

arXiv 2026

[1] [1]

Neurons in the retina: organization, inhibition and excitation problems

Stephen W Kuffler. Neurons in the retina: organization, inhibition and excitation problems. InCold Spring Harbor Symposia on Quantitative Biology, volume 17, pages 281–292. Cold Spring Harbor Laboratory Press, 1952

1952

[2] [2]

Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus.Journal of Neuroscience, 19(18):8036–8042, 1999

Garrett B Stanley, Fei F Li, and Yang Dan. Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus.Journal of Neuroscience, 19(18):8036–8042, 1999

1999

[3] [3]

The connections of the middle temporal visual area (mt) and their relationship to a cortical hierarchy in the macaque monkey.Journal of Neuroscience, 3(12):2563–2586, 1983

John H Maunsell and David C van Essen. The connections of the middle temporal visual area (mt) and their relationship to a cortical hierarchy in the macaque monkey.Journal of Neuroscience, 3(12):2563–2586, 1983

1983

[4] [4]

Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral cortex.Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991

1991

[5] [5]

Inferotemporal cortex and vision.Progress in physiological psychology, 5:77–123, 1973

Charles G Gross. Inferotemporal cortex and vision.Progress in physiological psychology, 5:77–123, 1973

1973

[6] [6]

Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex.Journal of neurophysiology, 71(3):856–867, 1994

Eucaly Kobatake and Keiji Tanaka. Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex.Journal of neurophysiology, 71(3):856–867, 1994

1994

[7] [7]

Comparing face patch systems in macaques and humans.Proceedings of the National Academy of Sciences, 105(49):19514–19519, 2008

Doris Y Tsao, Sebastian Moeller, and Winrich A Freiwald. Comparing face patch systems in macaques and humans.Proceedings of the National Academy of Sciences, 105(49):19514–19519, 2008

2008

[8] [8]

face cells

Kasper Vinken, Jacob S Prince, Talia Konkle, and Margaret S Livingstone. The neural code for “face cells” is not face-specific.Science advances, 9(35):eadg1736, 2023

2023

[9] [9]

Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the national academy of sciences, 111(23):8619–8624, 2014

2014

[10] [10]

Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex.Nature neuroscience, 19(3):356–365, 2016

2016

[11] [11]

Deep supervised, but not unsupervised, models may explain it cortical representation.PLoS computational biology, 10(11):e1003915, 2014

Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation.PLoS computational biology, 10(11):e1003915, 2014

2014

[12] [12]

Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like?BioRxiv, page 407007, 2018

2018

[13] [13]

The neural architecture of language: Integra- tive modeling converges on predictive processing.Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021

Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A Hosseini, Nancy Kan- wisher, Joshua B Tenenbaum, and Evelina Fedorenko. The neural architecture of language: Integra- tive modeling converges on predictive processing.Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021. 10

2021

[14] [14]

Neural population control via deep image synthesis

Pouya Bashivan, Kohitij Kar, and James J DiCarlo. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019

2019

[15] [15]

Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences.Cell, 177(4):999–1009, 2019

Carlos R Ponce, Will Xiao, Peter F Schade, Till S Hartmann, Gabriel Kreiman, and Margaret S Livingstone. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences.Cell, 177(4):999–1009, 2019

2019

[16] [16]

Deep convolutional models improve predictions of macaque v1 responses to natural images.PLoS computational biology, 15(4):e1006897, 2019

Santiago A Cadena, George H Denfield, Edgar Y Walker, Leon A Gatys, Andreas S Tolias, Matthias Bethge, and Alexander S Ecker. Deep convolutional models improve predictions of macaque v1 responses to natural images.PLoS computational biology, 15(4):e1006897, 2019

2019

[17] [17]

A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy.Neuron, 98(3):630–644, 2018

Alexander JE Kell, Daniel LK Yamins, Erica N Shook, Sam V Norman-Haignere, and Josh H McDermott. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy.Neuron, 98(3):630–644, 2018

2018

[18] [18]

Dimensionality reduction for large-scale neural recordings.Nature neuroscience, 17(11):1500–1509, 2014

John P Cunningham and Byron M Yu. Dimensionality reduction for large-scale neural recordings.Nature neuroscience, 17(11):1500–1509, 2014

2014

[19] [19]

Interpreting encoding and decoding models.Current opinion in neurobiology, 55:167–179, 2019

Nikolaus Kriegeskorte and Pamela K Douglas. Interpreting encoding and decoding models.Current opinion in neurobiology, 55:167–179, 2019

2019

[20] [20]

Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

Steffen Schneider, Jin Hwa Lee, and Mackenzie Weygandt Mathis. Learnable latent embeddings for joint behavioural and neural analysis.Nature, 617(7960):360–368, 2023

2023

[21] [21]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[22] [22]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

2021

[23] [23]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023

[24] [24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[25] [25]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022

[26] [26]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024

[27] [27]

Steering clip’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering clip’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

arXiv 2025

[28] [28]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

[29] [29]

https://transformer-circuits.pub/2023/monosemantic-features/index.html

2023

[30] [30]

Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models.arXiv preprint arXiv:2502.12892, 2025

Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models.arXiv preprint arXiv:2502.12892, 2025

arXiv 2025

[31] [31]

Survey of spiking in the mouse visual system reveals functional hierarchy.Nature, 592(7852):86–92, 2021

Joshua H Siegle, Xiaoxuan Jia, Séverine Durand, Sam Gale, Corbett Bennett, Nile Graddis, Greggory Heller, Tamina K Ramirez, Hannah Choi, Jennifer A Luviano, et al. Survey of spiking in the mouse visual system reveals functional hierarchy.Nature, 592(7852):86–92, 2021. 11

2021

[32] [32]

Neuropixels 2.0: A miniaturized high- density probe for stable, long-term brain recordings.Science, 372(6539):eabf4588, 2021

Nicholas A Steinmetz, Cagatay Aydin, Anna Lebedeva, Michael Okun, Marius Pachitariu, Marius Bauza, Maxime Beau, Jai Bhagat, Claudia Böhm, Martijn Broux, et al. Neuropixels 2.0: A miniaturized high- density probe for stable, long-term brain recordings.Science, 372(6539):eabf4588, 2021

2021

[33] [33]

Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

arXiv 2023

[34] [34]

A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

2024

[35] [35]

High-level visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. High-level visual representations in the human brain are aligned with large language models. Nature Machine Intelligence, 7(8):1220–1234, 2025

2025

[36] [36]

Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016

2016

[37] [37]

Evidence of a predictive coding hierarchy in the human brain listening to speech.Nature human behaviour, 7(3):430–441, 2023

Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Evidence of a predictive coding hierarchy in the human brain listening to speech.Nature human behaviour, 7(3):430–441, 2023

2023

[38] [38]

Brain encoding models based on multi- modal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

Jerry Tang, Meng Du, Vy V o, Vasudev Lal, and Alexander Huth. Brain encoding models based on multi- modal transformers can transfer across language and vision.Advances in neural information processing systems, 36:29654–29666, 2023

2023

[39] [39]

High-resolution image reconstruction with latent diffusion models from human brain activity

Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14453–14463, 2023

2023

[40] [40]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

arXiv 2024

[41] [41]

Multiscale voxel based decoding for enhanced natural image reconstruction from brain activity

Mali Halac, Murat Isik, Hasan Ayaz, and Anup Das. Multiscale voxel based decoding for enhanced natural image reconstruction from brain activity. In2022 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2022

2022

[42] [42]

Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, pages 1–9, 2023

Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G Huth. Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, pages 1–9, 2023

2023

[43] [43]

Decoding speech from non-invasive brain recordings.arXiv preprint arXiv:2208.12266, 2022

Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech from non-invasive brain recordings.arXiv preprint arXiv:2208.12266, 2022

arXiv 2022

[44] [44]

Tarr, and Leila Wehbe

Andrew Luo, Margaret Marie Henderson, Michael J. Tarr, and Leila Wehbe. BrainSCUBA: Fine-grained natural language captions of visual cortex selectivity. InThe Twelfth International Conference on Learning Representations, 2024

2024

[45] [45]

Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature methods, 22(10):2107–2117, 2025

Elana Simon and James Zou. Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature methods, 22(10):2107–2117, 2025

2025

[46] [46]

Sparse autoencoders uncover biologically interpretable features in protein language model representations.Proceedings of the National Academy of Sciences, 122(34):e2506316122, 2025

Onkar Gujral, Mihir Bafna, Eric Alm, and Bonnie Berger. Sparse autoencoders uncover biologically interpretable features in protein language model representations.Proceedings of the National Academy of Sciences, 122(34):e2506316122, 2025

2025

[47] [47]

Be- yond black boxes: Enhancing interpretability of transformers trained on neural data.arXiv preprint arXiv:2506.14014, 2025

Laurence Freeman, Philip Shamash, Vinam Arora, Caswell Barry, Tiago Branco, and Eva Dyer. Be- yond black boxes: Enhancing interpretability of transformers trained on neural data.arXiv preprint arXiv:2506.14014, 2025

arXiv 2025

[48] [48]

Neural models for detection and classification of brain states and transitions.Communica- tions Biology, 8(1):599, 2025

Arnau Marin-Llobet, Arnau Manasanch, Leonardo Dalla Porta, Melody Torao-Angosto, and Maria V Sanchez-Vives. Neural models for detection and classification of brain states and transitions.Communica- tions Biology, 8(1):599, 2025

2025

[49] [49]

Physmap-interpretable in vivo neuronal cell type identification using multi-modal analysis of electrophysiological data.BioRxiv, pages 2024–02, 2024

Eric Kenji Lee, Asım Emre Gül, Greggory Heller, Anna Lakunina, Santiago Jaramillo, Pawel F Przytycki, and Chandramouli Chandrasekaran. Physmap-interpretable in vivo neuronal cell type identification using multi-modal analysis of electrophysiological data.BioRxiv, pages 2024–02, 2024. 12

2024

[50] [50]

A deep learning strategy to identify cell types across species from high-density extracellular recordings.Cell, 188(8):2218–2234, 2025

Maxime Beau, David J Herzfeld, Francisco Naveros, Marie E Hemelt, Federico D’Agostino, Marlies Oostland, Alvaro Sánchez-López, Young Yoon Chung, Michael Maibach, Stephen Kyranakis, et al. A deep learning strategy to identify cell types across species from high-density extracellular recordings.Cell, 188(8):2218–2234, 2025

2025

[51] [51]

In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pages 2024–11, 2025

Han Yu, Hanrui Lyu, Ethan Yixun Xu, Charlie Windolf, Eric Kenji Lee, Fan Yang, Andrew M Shelton, Shawn Olsen, Sahar Minavi, Olivier Winter, et al. In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pages 2024–11, 2025

2024

[52] [52]

An ai agent for cell-type specific brain computer interfaces

Arnau Marin-Llobet, Zuwan Lin, Jongmin Baek, Almir Aljovic, Xinhe Zhang, Ariel J Lee, Wenbo Wang, Jaeyong Lee, Hao Shen, Yichun He, et al. An ai agent for cell-type specific brain computer interfaces. bioRxiv, 2025

2025

[53] [53]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[54] [54]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv, abs/1908.10084, 2019

Pith/arXiv arXiv 1908

[55] [55]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015

2015

[56] [56]

A cortical circuit for gain control by behavioral state.Cell, 156(6):1139–1152, 2014

Yu Fu, Jason M Tucciarone, J Sebastian Espinosa, Nengyin Sheng, Daniel P Darcy, Roger A Nicoll, Z Josh Huang, and Michael P Stryker. A cortical circuit for gain control by behavioral state.Cell, 156(6):1139–1152, 2014

2014

[57] [57]

Cortical interneurons that specialize in disinhibitory control.Nature, 503(7477):521–524, 2013

Hyun-Jae Pi, Balázs Hangya, Duda Kvitsiani, Joshua I Sanders, Z Josh Huang, and Adam Kepecs. Cortical interneurons that specialize in disinhibitory control.Nature, 503(7477):521–524, 2013

2013

[58] [58]

Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023

[59] [59]

Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024

[60] [60]

An odor is not worth a thousand words: from multidimensional odors to unidimensional odor objects.Annual review of psychology, 61(1):219–241, 2010

Yaara Yeshurun and Noam Sobel. An odor is not worth a thousand words: from multidimensional odors to unidimensional odor objects.Annual review of psychology, 61(1):219–241, 2010

2010

[61] [61]

USER: <image>\n Describe this scene in one sentence.\n ASSISTANT:

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, and Stefan Scherer. Sparse clip: Co-optimizing interpretability and performance in contrastive learning.ArXiv, abs/2601.20075, 2026. 13 A Appendix This appendix collects supporting material referenced from the main text. Section A.2 provides full architectural and training-configuration details for ...

arXiv 2026