arxiv: 2604.15503 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

Jingnong Qu , Ashvin Ranjan , Shane Steinert-Threlkeld

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords brain scorelanguage modelsfMRIstructured sequenceshierarchical structurenon-linguistic datacross-lingual

0 comments

The pith

Models trained on genomes, Python, or nested brackets achieve Brain Scores close to natural-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether Brain Score, the ability of a language model's activations to predict fMRI responses during reading, specifically tracks human language processing. Models trained on natural languages from many different families produce very similar Brain Scores. Models trained instead on the human genome, Python source code, or purely hierarchical nested parentheses also reach comparable scores. This indicates that Brain Score mainly registers the extraction of shared sequential or hierarchical structure across many kinds of data. Consequently, a high Brain Score by itself does not establish that a model processes language in a human-like way.

Core claim

Language models trained on diverse natural languages from many families yield similar Brain Scores, while models trained on the human genome, Python code, and nested parentheses also produce Brain Scores close to those of natural-language models. Brain Score therefore measures models' ability to extract common structural properties present across structured sequences rather than features unique to human language.

What carries the argument

Brain Score, defined as the accuracy with which language-model hidden states predict fMRI voxel responses recorded while humans read natural sentences.

Load-bearing premise

That Brain Score still measures similarity to human language processing when the model has never seen natural language in its training data.

What would settle it

Finding that models trained on unstructured random sequences produce Brain Scores as high as those from genome, code, or natural-language training would show the metric does not track shared structural properties.

Figures

Figures reproduced from arXiv: 2604.15503 by Ashvin Ranjan, Jingnong Qu, Shane Steinert-Threlkeld.

**Figure 1.** Figure 1: The pipeline for training and evaluating the models. All training starts from a randomly initialized model. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Effects of different embedding adaptations on BS. For each dataset in each training regime, the scores for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data -- the human genome, Python, and pure hierarchical structure (nested parentheses) -- also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models' ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates the specificity of Brain Score (BS) by training language models on a variety of natural languages from different families and on non-linguistic structured sequences including the human genome, Python code, and nested parentheses. It reports that BS performance is similar across natural languages and reasonably close for the structured data models, suggesting that BS measures shared structural properties rather than uniquely human language processing.

Significance. If the quantitative results support the qualitative claims, this paper makes a valuable contribution by providing evidence that high Brain Scores can be achieved without training on natural language, thereby limiting the inferences one can draw about human-like processing from BS alone. The use of diverse natural languages and multiple types of structured data is a strength, as is the focus on falsifying the specificity of the metric. This could prompt reevaluation of BS as a benchmark in the field.

major comments (3)

[Results] The claim that LMs trained on genome, Python, and nested parentheses 'perform reasonably well and close to natural languages in some cases' is central but presented without specific numerical BS values, standard errors, or p-values comparing to natural language baselines (see also the abstract). This makes it impossible to assess the magnitude and reliability of the 'close' performance.
[Methods] There is no mention of controlling for model size, training data volume, or number of parameters across the different training regimes. Since BS involves linear regression from LM activations, differences in model capacity could confound the comparisons between natural language and structured data models.
[Discussion] The interpretation that BS 'may not be sensitive enough to allow us to infer human-like processing' rests on the assumption that the fMRI prediction task remains valid for models without natural language exposure; however, no analysis is provided on the quality of the linear mappings or voxel selection for these models.

minor comments (2)

[Abstract] The abstract uses vague terms like 'reasonably well' and 'in some cases' without quantifying what that means; consider adding a brief mention of the range of BS values observed.
[Introduction] Ensure that the definition of Brain Score is clearly restated with reference to the original BS paper for readers unfamiliar with the framework.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the constructive feedback and the recognition of the paper's potential contribution. Below we provide point-by-point responses to the major comments, outlining the revisions we will make to address them.

read point-by-point responses

Referee: [Results] The claim that LMs trained on genome, Python, and nested parentheses 'perform reasonably well and close to natural languages in some cases' is central but presented without specific numerical BS values, standard errors, or p-values comparing to natural language baselines (see also the abstract). This makes it impossible to assess the magnitude and reliability of the 'close' performance.

Authors: We agree that the presentation would benefit from more precise quantitative details. In the revised manuscript, we will include a table or expanded figure caption with the exact Brain Score values for models trained on genome, Python, and nested parentheses, along with standard errors and p-values from statistical tests against the natural language baselines. This will enable readers to evaluate the degree of similarity more rigorously. revision: yes
Referee: [Methods] There is no mention of controlling for model size, training data volume, or number of parameters across the different training regimes. Since BS involves linear regression from LM activations, differences in model capacity could confound the comparisons between natural language and structured data models.

Authors: We recognize this potential issue. Our models were based on comparable transformer architectures, but parameter counts and data volumes were not strictly matched across all conditions owing to the unique characteristics of each dataset. We will update the Methods section to detail the specific model configurations, including parameter numbers and training data sizes. We will also discuss the implications of any mismatches and consider adding a control experiment if possible in future work, though for the current revision we will at minimum provide transparency on these factors. revision: partial
Referee: [Discussion] The interpretation that BS 'may not be sensitive enough to allow us to infer human-like processing' rests on the assumption that the fMRI prediction task remains valid for models without natural language exposure; however, no analysis is provided on the quality of the linear mappings or voxel selection for these models.

Authors: The Brain Score framework applies the identical prediction pipeline, including linear regression and voxel selection (retaining voxels with significant cross-validated prediction), to all models irrespective of their training data. This ensures the task validity by construction. To address the referee's point, we will incorporate additional analyses in the revised version, such as reporting average R^2 values of the linear mappings and the number of selected voxels for the non-natural-language models, to confirm that the prediction quality is not substantially degraded. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical results from training LMs on natural languages, the human genome, Python code, and nested parentheses, then directly comparing their Brain Scores on fMRI prediction during language reading. No equations, parameter fitting, derivations, or self-citations are described that reduce any claim to its inputs by construction. All central findings are presented as measured outcomes of distinct training regimes, rendering the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, axioms, or invented entities; the work is purely empirical comparison of training regimes.

pith-pipeline@v0.9.0 · 5466 in / 1130 out tokens · 46356 ms · 2026-05-10T11:00:41.232395+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 16 canonical work pages

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press, Cambridge, England

1988
[4]

Douglas Biber. 1993. https://aclanthology.org/J93-2001/ Using Register - Diversified Corpora for General Language Studies . Computational Linguistics, 19(2):219--241

1993
[5]

Zhenguang Cai, Xufeng Duan, David Haslett, Shuqi Wang, and Martin Pickering. 2024. https://doi.org/10.18653/v1/2024.cmcl-1.4 Do large language models resemble humans in language use? In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics , pages 37--56, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.cmcl-1.4 2024
[6]

Cheng-Han Chiang and Hung-yi Lee. 2022. https://doi.org/10.1609/aaai.v36i10.21295 On the Transferability of Pre -trained Language Models : A Study from Artificial Datasets . Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10518--10525

work page doi:10.1609/aaai.v36i10.21295 2022
[7]

Dryer and Martin Haspelmath, editors

Matthew S. Dryer and Martin Haspelmath, editors. 2013. https://doi.org/10.5281/zenodo.13950591 WALS Online (v2020.4) . Zenodo

work page doi:10.5281/zenodo.13950591 2013
[8]

Manuel Faysse. 2023. https://huggingface.co/datasets/manu/project_gutenberg Project Gutenberg . Hugging Face Datasets

2023
[9]

Blank, and Jonathan C

Ebrahim Feghhi, Nima Hadidi, Bryan Song, Idan A. Blank, and Jonathan C. Kao. 2024. https://doi.org/10.48550/arXiv.2406.01538 What Are Large Language Models Mapping to in the Brain ? A Case Against Over - Reliance on Brain Scores . ArXiv:2406.01538 [cs]

work page doi:10.48550/arxiv.2406.01538 2024
[10]

Hosseini, Martin Schrimpf, Yian Zhang, Samuel Bowman, Noga Zaslavsky, and Evelina Fedorenko

Eghbal A. Hosseini, Martin Schrimpf, Yian Zhang, Samuel Bowman, Noga Zaslavsky, and Evelina Fedorenko. 2024. https://doi.org/10.1162/nol_a_00137 Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training . Neurobiology of Language, 5(1):43--63

work page doi:10.1162/nol_a_00137 2024
[11]

Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen

Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen. 2025. https://doi.org/10.18653/v1/2025.acl-long.478 Between Circuits and Chomsky : Pre -pretraining on Formal Languages Imparts Linguistic Biases . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 9691--9709...

work page doi:10.18653/v1/2025.acl-long.478 2025
[12]

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, and Damien Teney. 2026. https://doi.org/10.48550/arXiv.2601.21725 Procedural Pretraining : Warming Up Language Models with Abstract Data . ArXiv:2601.21725 [cs]

work page doi:10.48550/arxiv.2601.21725 2026
[13]

Carina Kauf, Greta Tuckute, Roger Levy, Jacob Andreas, and Evelina Fedorenko. 2024. https://doi.org/10.1162/nol_a_00116 Lexical- Semantic Content , Not Syntactic Structure , Is the Main Contributor to ANN - Brain Similarity of fMRI Responses in the Language Network . Neurobiology of Language, 5(1):7--42

work page doi:10.1162/nol_a_00116 2024
[14]

Najoung Kim, Sebastian Schuster, and Shubham Toshniwal. 2024. https://doi.org/10.48550/arXiv.2405.21068 Code Pretraining Improves Entity Tracking Abilities of Language Models . ArXiv:2405.21068 [cs]

work page doi:10.48550/arxiv.2405.21068 2024
[15]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack : 3 TB of permissively licensed source code. Preprint

2022
[16]

National Center for Biotechnology Information . 2022. https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/ Genome assembly GRCh38.p14 . Accession No. GCF\_000001405.40

2022
[17]

Isabel Papadimitriou and Dan Jurafsky. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.554 Learning Music Helps You Read : Using Transfer to Study Linguistic Structure in Language Models . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 6829--6839, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.554 2020
[18]

Isabel Papadimitriou and Dan Jurafsky. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.563 Injecting structural hints: Using language models to study inductive biases in language learning . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 8402--8413, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.563 2023
[19]

Hale, Bertrand Thirion, and Christophe Pallier

Alexandre Pasquiou, Yair Lakretz, John T. Hale, Bertrand Thirion, and Christophe Pallier. 2022. https://proceedings.mlr.press/v162/pasquiou22a.html Neural Language Models are not Born Equal to Fit Brain Data , but Training Helps . In Proceedings of the 39th International Conference on Machine Learning , pages 17499--17516. PMLR

2022
[20]

Gershman, Nancy Kanwisher, Matthew Botvinick, and Evelina Fedorenko

Francisco Pereira, Bin Lou, Brianna Pritchett, Samuel Ritter, Samuel J. Gershman, Nancy Kanwisher, Matthew Botvinick, and Evelina Fedorenko. 2018. https://doi.org/10.1038/s41467-018-03068-4 Toward a universal decoder of linguistic meaning from brain activation . Nature Communications, 9(1):963

work page doi:10.1038/s41467-018-03068-4 2018
[21]

Project Gutenberg . n.d. https://www.gutenberg.org/ [link]
[22]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language Models are Unsupervised Multitask Learners . OpenAI Blog

2019
[23]

Ryokan Ri and Yoshimasa Tsuruoka. 2022. https://doi.org/10.18653/v1/2022.acl-long.504 Pretraining with Artificial Language : Studying Transferable Knowledge in Language Models . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 7302--7315, Dublin, Ireland. Association for Computati...

work page doi:10.18653/v1/2022.acl-long.504 2022
[24]

Hosseini, Nancy Kanwisher, Joshua B

Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A. Hosseini, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2021. https://doi.org/10.1073/pnas.2105646118 The neural architecture of language: Integrative modeling converges on predictive processing . Proceedings of the National Academy of Sciences, 118(45):e2105646118

work page doi:10.1073/pnas.2105646118 2021
[25]

Majaj, Rishi Rajalingham, Elias B

Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J. Majaj, Rishi Rajalingham, Elias B. Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, Kailyn Schmidt, Daniel L. K. Yamins, and James J. DiCarlo. 2018. https://www.biorxiv.org/content/10.1101/407007v2 Brain-score: Which artificial neural network for object recognition is most brain...

work page doi:10.1101/407007v2 2018
[26]

Martin Schrimpf, Jonas Kubilius, Michael J Lee, N Apurva Ratan Murty, Robert Ajemian, and James J DiCarlo. 2020. https://www.cell.com/neuron/fulltext/S0896-6273(20)30605-X Integrative benchmarking to advance neurally mechanistic models of human intelligence . Neuron

2020
[27]

Wikimedia Foundation . 2023. https://huggingface.co/datasets/wikimedia/wikipedia Wikimedia Downloads

2023
[28]

Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. 2024. https://doi.org/10.1162/ling_a_00491 Using Computational Models to Test Syntactic Learnability . Linguistic Inquiry, 55(4):805--848

work page doi:10.1162/ling_a_00491 2024