pith. sign in

arxiv: 2509.23108 · v2 · pith:SRZR5XUZnew · submitted 2025-09-27 · 💻 cs.AI · cs.CL

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

Pith reviewed 2026-05-21 21:37 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords mental imagerylarge language modelspropositional representationsartificial phantasiacompositional transformationsemergent abilitiescognitive sciencevisual imagery
0
0 comments X

The pith

Large language models outperform humans on tasks that require imagining compositional letter and shape transformations, pointing to an emergent form of mental imagery that relies on language rather than pictures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether visual mental imagery can arise from language alone by extending a classic task that cognitive science has long held requires pictorial representations. Participants, including top LLMs and 100 human subjects, were asked to mentally apply sequences of letter and shape changes and report the final result. The strongest models scored significantly higher than people, with performance rising when models were allowed longer chains of reasoning steps. This pattern suggests the task can be solved through propositional operations on linguistic descriptions instead of internal pictures. The work therefore challenges the necessity of a pictorial format for mental imagery and labels the LLM capacity an artificial phantasia.

Core claim

The best LLMs achieved markedly higher accuracy than human participants on novel items that require mentally composing letter and shape transformations, with statistical significance at p < .0001. Accuracy improved when reasoning models were given more tokens for step-by-step linguistic manipulation. These results support the existence of an emergent, non-pictorial mental imagery capacity in LLMs that can be driven entirely by language.

What carries the argument

An extended version of a classic mental-imagery task in which subjects must imagine successive compositional transformations of letters and shapes and then identify the resulting figure.

If this is right

  • Language alone can be sufficient for solving tasks previously assumed to need pictorial imagery.
  • Longer reasoning chains improve performance, showing a direct linguistic contribution to the imagery-like behavior.
  • LLMs may possess an emergent cognitive capacity that functions without internal pictures.
  • Traditional debates about whether mental imagery must be pictorial are reopened by the existence of this non-pictorial alternative.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If language-based operations can mimic visual imagery results, then some human imagery tasks might also be solved propositionally under certain conditions.
  • The finding invites experiments that isolate whether LLMs are truly simulating spatial relations or simply exploiting statistical patterns in textual descriptions of shapes.
  • Designers of future AI systems could deliberately train or prompt models to use extended linguistic chains for tasks that currently rely on vision modules.

Load-bearing premise

That the chosen transformation tasks cannot be solved by language-based reasoning and instead require pictorial mental representations.

What would settle it

A controlled version of the same tasks in which language-based shortcuts are removed or blocked and LLM performance drops to or below human levels.

Figures

Figures reproduced from arXiv: 2509.23108 by Jorge Morales, Morgan McCarty.

Figure 1
Figure 1. Figure 1: One of the instruction sets introduced in Finke et al. (1989). Here, subjects are meant to recognize from the resulting mental image that the final imagined object looks like an umbrella. The instructions have been rewritten to be clearer both for prompting LLMs, as well as for human understandability. 1.2 SOLVING MENTAL IMAGERY TASKS WITHOUT MENTAL IMAGERY? Many tasks in everyday life involve the usage of… view at source ↗
Figure 2
Figure 2. Figure 2: One of our new instruction sets demonstrating the slightly increased cognitive complexity and more ambiguous canonical form (“balloons”, “flower bouquet”, or “ice cream”, among others). Note the usage of two letters in the first step, the abstract reference to existing symbols and scenes, and the final shape not being determinable until the final step. Our new items integrated several changes to the origin… view at source ↗
Figure 3
Figure 3. Figure 3: Performance results in humans and LLMs. Data shows proportion of maximum possible score for all tested models. Only GPT-5, o3, and o3-Pro significantly surpass the human baseline. Error bars indicate 95% confidence intervals. results of the 48-novel item subset and Supplemental Table S4 for results of the 12-items following Finke et al.). The only other models to perform non-significantly different from th… view at source ↗
read the original abstract

Can visual imagery be driven solely by language? This idea goes against cognitive science's traditional view that visual mental imagery is only possible through pictorial representations. Large Language Models (LLMs) provide nascent evidence not only that visual mental imagery via propositional-representations is possible, but that it can be more robust than human imagination. We created dozens of novel items for an extension to a classic task which is argued to be solvable exclusively via pictorial representations (i.e., language alone would be insufficient). Subjects were asked to imagine a series of compositional letter and shape transformations and identify the resultant "image". We found that the best LLMs performed significantly better than humans ($n = 100$ human participants, $p < .0001$), indicating the existence of an artificial phantasia, or emergent "visual" mental imagery that may not be pictorial. Furthermore, we tested reasoning models with variable reasoning-token allocation and found that models perform best with longer reasoning chains, demonstrating a linguistic impact on the task -- language alone may be sufficient. We examined three emergent imagery hypotheses: pure propositional imagery, propositional imagery with visio-linguistic priors, or pictorial visual imagery (classical visual imagery). Our study not only presents evidence for a previously unreported emergent cognitive capacity of LLMs, but also reignites debate on the requirement for a pictorial format in mental imagery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that LLMs exhibit an emergent capacity for 'visual' mental imagery, termed artificial phantasia, which may operate via propositional rather than pictorial representations. This is supported by superior LLM performance over 100 human participants (p < .0001) on novel compositional letter/shape transformation tasks argued to require pictorial representations, with further evidence that longer reasoning chains improve results, challenging the necessity of pictorial formats in mental imagery.

Significance. If substantiated, the result would provide empirical grounds to revisit the long-standing assumption in cognitive science that visual mental imagery necessitates pictorial representations, while highlighting LLMs' capacity for robust task performance through language alone. The inclusion of variable reasoning-token tests and explicit hypothesis examination (propositional, visio-linguistic, or pictorial) adds concrete data, though the interpretation depends on validating the task premise.

major comments (3)
  1. [Abstract and Methods] The interpretation of LLM superiority as evidence for non-pictorial phantasia rests on the premise (Abstract) that the novel items 'are argued to be solvable exclusively via pictorial representations (i.e., language alone would be insufficient)'. No control experiments, verbal-strategy probes, or analysis demonstrating that propositional decomposition cannot reliably solve the items are reported, leaving open the possibility that humans underperform for reasons unrelated to imagery format (e.g., working memory or instruction compliance).
  2. [Results] The central performance claim (LLMs significantly better than humans, p < .0001) lacks reported details on exact task items, prompting protocols, controls for stochastic LLM output, or full statistical reporting (e.g., effect sizes, per-item breakdowns), which are required to evaluate whether the result is robust or sensitive to implementation choices.
  3. [Discussion] The three emergent imagery hypotheses (pure propositional, propositional with visio-linguistic priors, pictorial) are examined but without specific ablations or tests that would distinguish them; superior performance plus reasoning-length effects alone do not yet adjudicate between propositional sufficiency and any imagery format.
minor comments (2)
  1. [Introduction] Define 'artificial phantasia' with a concise operational contrast to classical pictorial imagery in the introduction to prevent terminological overlap.
  2. [Figures] Add error bars or confidence intervals to all performance figures and ensure legends clearly distinguish human vs. LLM conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for clarification and strengthening in our manuscript. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Methods] The interpretation of LLM superiority as evidence for non-pictorial phantasia rests on the premise (Abstract) that the novel items 'are argued to be solvable exclusively via pictorial representations (i.e., language alone would be insufficient)'. No control experiments, verbal-strategy probes, or analysis demonstrating that propositional decomposition cannot reliably solve the items are reported, leaving open the possibility that humans underperform for reasons unrelated to imagery format (e.g., working memory or instruction compliance).

    Authors: We accept this critique as valid. The task premise draws from established cognitive science literature on mental imagery transformations (e.g., mental rotation studies), but we did not include dedicated verbal-strategy probes or controls to isolate propositional solvability. In revision we will expand the Methods and Discussion sections with explicit references to the supporting literature, add a dedicated limitations paragraph addressing alternative explanations such as working memory load and instruction compliance, and note that future work could incorporate verbal probes. This constitutes a partial revision because new empirical controls cannot be added retroactively without additional data collection. revision: partial

  2. Referee: [Results] The central performance claim (LLMs significantly better than humans, p < .0001) lacks reported details on exact task items, prompting protocols, controls for stochastic LLM output, or full statistical reporting (e.g., effect sizes, per-item breakdowns), which are required to evaluate whether the result is robust or sensitive to implementation choices.

    Authors: We agree that greater transparency is required. The revised manuscript will include the complete set of task items and examples in the supplementary materials, a detailed account of prompting protocols (including system prompts and temperature settings), and explicit controls for stochasticity (multiple runs with fixed seeds where applicable). We will also report effect sizes, confidence intervals, and per-item accuracy breakdowns alongside the existing p-value to allow full assessment of robustness. revision: yes

  3. Referee: [Discussion] The three emergent imagery hypotheses (pure propositional, propositional with visio-linguistic priors, pictorial) are examined but without specific ablations or tests that would distinguish them; superior performance plus reasoning-length effects alone do not yet adjudicate between propositional sufficiency and any imagery format.

    Authors: We acknowledge that the current evidence, while suggestive, does not fully adjudicate among the three hypotheses. The observed benefit of longer reasoning chains supports propositional sufficiency but cannot rule out contributions from visio-linguistic priors or latent pictorial mechanisms. In the revised Discussion we will more explicitly delineate these limitations, clarify that our primary claim concerns the sufficiency of language-based processing, and outline targeted future experiments (e.g., visual-priming ablations and non-visual control tasks) that could distinguish the hypotheses. This will be a partial revision focused on interpretive framing rather than new empirical tests. revision: partial

Circularity Check

0 steps flagged

No circularity; central claim rests on direct empirical comparison

full rationale

The paper's derivation consists of creating novel task items, administering them to LLMs and 100 human participants, and reporting a statistically significant performance advantage for the best LLMs (p < .0001). This result is obtained from experimental data rather than any equation, fitted parameter, or self-citation that reduces the outcome to its own inputs by construction. The interpretation linking superior LLM performance to non-pictorial 'phantasia' draws on the classic task's prior literature and the observed benefit of longer reasoning chains, but introduces no self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that the selected tasks cannot be solved by propositional means alone and on the interpretive step that superior LLM performance demonstrates emergent mental imagery.

axioms (1)
  • domain assumption The classic task is solvable exclusively via pictorial representations and language alone would be insufficient.
    Invoked in the abstract when describing why the task extension tests for non-pictorial imagery.
invented entities (1)
  • artificial phantasia no independent evidence
    purpose: Label for the observed emergent visual mental imagery capacity in LLMs.
    New term introduced to describe the phenomenon; no independent falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5765 in / 1179 out tokens · 72687 ms · 2026-05-21T21:37:14.046367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 9 internal anchors

  1. [1]

    Bainbridge, Zoë Pounder, Alison F

    Wilma A. Bainbridge, Zoë Pounder, Alison F. Eardley, and Chris I. Baker. Quantifying aphantasia through drawing: Those without visual imagery show deficits in object but not spatial memory. Cortex, 135: 0 159--172, 2021. ISSN 0010-9452. doi:10.1016/j.cortex.2020.11.014

  2. [2]

    Tell me about yourself: Llms are aware of their learned behaviors, 2025

    Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. Tell me about yourself: Llms are aware of their learned behaviors, 2025. URL https://arxiv.org/abs/2501.11120

  3. [3]

    Bigelow, John P

    Eric J. Bigelow, John P. McCoy, and Tomer D. Ullman. Non-commitment in mental imagery. Cognition, 238: 0 105498, 2023

  4. [4]

    The Border between Seeing and Thinking

    Ned Block. The Border between Seeing and Thinking. Oxford University Press, 2023

  5. [5]

    Aphantasia: In search of a theory

    Andrea Blomkvist. Aphantasia: In search of a theory. Mind & Language, 38 0 (3): 0 866--888, 2023. doi:https://doi.org/10.1111/mila.12432. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/mila.12432

  6. [6]

    The key of the maze: The role of mental imagery and cognitive flexibility in navigational planning

    Alessia Bocchi, Marika Carrieri, Stefania Lancia, Valentina Quaresima, and Laura Piccardi. The key of the maze: The role of mental imagery and cognitive flexibility in navigational planning. Neuroscience Letters, 651: 0 146--150, 2017. ISSN 0304-3940. doi:https://doi.org/10.1016/j.neulet.2017.05.009. URL https://www.sciencedirect.com/science/article/pii/S...

  7. [7]

    Smith, Yejin Choi, and Hannaneh Hajishirzi

    Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. The art of saying no: Contextual noncompliance in language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. P...

  8. [8]

    Remembering the past and imagining the future: a neural model of spatial memory and imagery

    Patrick Byrne, Suzanna Becker, and Neil Burgess. Remembering the past and imagining the future: a neural model of spatial memory and imagery. Psychological review, 114 0 (2): 0 340, 2007

  9. [9]

    Zirui Chen and Michael F. Bonner. Universal dimensions of visual representation. Science Advances, 11 0 (27): 0 eadw7697, 2025. doi:10.1126/sciadv.adw7697. URL https://www.science.org/doi/abs/10.1126/sciadv.adw7697

  10. [10]

    Arc prize 2024: Technical report, 2025

    Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report, 2025. URL https://arxiv.org/abs/2412.04604

  11. [11]

    Dance, A

    C.J. Dance, A. Ipser, and J. Simner. The prevalence of aphantasia (imagery weakness) in the general population. Consciousness and Cognition, 97: 0 103243, 2022. ISSN 1053-8100. doi:https://doi.org/10.1016/j.concog.2021.103243. URL https://www.sciencedirect.com/science/article/pii/S1053810021001690

  12. [12]

    Dawes, Rebecca Keogh, Sarah Robuck, and Joel Pearson

    Alexei J. Dawes, Rebecca Keogh, Sarah Robuck, and Joel Pearson. Memories with a blind mind: Remembering the past and imagining the future with aphantasia. Cognition, 227: 0 105192, 2022. ISSN 0010-0277. doi:10.1016/j.cognition.2022.105192

  13. [13]

    Using large language models in psychology

    Dorottya Demszky, Diyi Yang, David S Yeager, Christopher J Bryan, Margarett Clapper, Susannah Chandhok, Johannes C Eichstaedt, Cameron Hecht, Jeremy Jamieson, Meghann Johnson, et al. Using large language models in psychology. Nature Reviews Psychology, 2 0 (11): 0 688--701, 2023

  14. [14]

    Shared neural mechanisms of visual perception and imagery

    Nadine Dijkstra, Sander E Bosch, and Marcel AJ van Gerven. Shared neural mechanisms of visual perception and imagery. Trends in cognitive sciences, 23 0 (5): 0 423--434, 2019

  15. [15]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning, 2024. URL https://arxiv.org/abs/2301.00234

  16. [16]

    Is visual imagery really visual? overlooked evidence from neuropsychology

    Martha J Farah. Is visual imagery really visual? overlooked evidence from neuropsychology. Psychological review, 95 0 (3): 0 307, 1988

  17. [17]

    Seven plus or minus two

    Jeanne Farrington. Seven plus or minus two. Performance Improvement Quarterly, 23 0 (4): 0 113--116, 2011. doi:https://doi.org/10.1002/piq.20099. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/piq.20099

  18. [18]

    Conflicting intuitions may be based on differing abilities: Evidence from mental imaging research

    Bill Faw. Conflicting intuitions may be based on differing abilities: Evidence from mental imaging research. Journal of Consciousness Studies, 16: 0 45--68, 01 2009

  19. [19]

    Creative Imagery: Discoveries and Inventions in Visualization

    Ronald Finke. Creative Imagery: Discoveries and Inventions in Visualization. Psychology Press, 1990

  20. [20]

    Reinterpreting visual patterns in mental imagery

    Ronald A Finke, Steven Pinker, and Martha J Farah. Reinterpreting visual patterns in mental imagery. Cognitive Science, 13 0 (1): 0 51--78, 1989

  21. [21]

    Frank and Noah D

    Michael C. Frank and Noah D. Goodman. Cognitive modeling using artificial intelligence. Annual Review of Psychology, 2025. ISSN 0066-4308. doi:https://doi.org/10.1146/annurev-psych-030625-040748. URL https://www.annualreviews.org/content/journals/10.1146/annurev-psych-030625-040748

  22. [22]

    ImageBind : One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. ImageBind : One embedding space to bind them all. arXiv , 2023. doi:10.48550/arxiv.2305.05665

  23. [23]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

  24. [24]

    Mental imagery in emotion and emotional disorders

    Emily A Holmes and Andrew Mathews. Mental imagery in emotion and emotional disorders. Clinical psychology review, 30 0 (3): 0 349--362, 2010

  25. [25]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 78723--78747. Curran Associates, Inc., 2023. URL http...

  26. [26]

    Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H

    Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin Akyürek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, and Jacob Andreas. Elements of world knowledge (ewok): A cognition-...

  27. [27]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  28. [28]

    Slower but more accurate mental rotation performance in aphantasia linked to differences in cognitive strategies

    Lachlan Kay, Rebecca Keogh, and Joel Pearson. Slower but more accurate mental rotation performance in aphantasia linked to differences in cognitive strategies. Consciousness and Cognition, 121: 0 103694, 2024. ISSN 1053-8100. doi:https://doi.org/10.1016/j.concog.2024.103694. URL https://www.sciencedirect.com/science/article/pii/S1053810024000618

  29. [29]

    Visual working memory in aphantasia: Retained accuracy and capacity with a different strategy

    Rebecca Keogh, Marcus Wicken, and Joel Pearson. Visual working memory in aphantasia: Retained accuracy and capacity with a different strategy. Cortex, 143: 0 237--253, 2021. ISSN 0010-9452. doi:https://doi.org/10.1016/j.cortex.2021.07.012. URL https://www.sciencedirect.com/science/article/pii/S0010945221002628

  30. [30]

    xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval, 2023

    Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval, 2023. URL https://arxiv.org/abs/2303.03004

  31. [31]

    Learning image embeddings using convolutional neural networks for improved multi-modal semantics

    Douwe Kiela and L \'e on Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 36--45, Doha, Qatar, October 2014. Association for Computat...

  32. [32]

    Mule: Multimodal universal language embedding

    Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. Mule: Multimodal universal language embedding. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (07): 0 11254--11261, Apr. 2020. doi:10.1609/aaai.v34i07.6785. URL https://ojs.aaai.org/index.php/AAAI/article/view/6785

  33. [33]

    , author Linzen, T

    Najoung Kim and Tal Linzen. COGS : A compositional generalization challenge based on semantic interpretation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 9087--9105, Online, November 2020. Association for Computational Linguistics. doi:10...

  34. [34]

    The N arrative QA reading comprehension challenge

    Tom \'a s Ko c isk \'y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. The N arrative QA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018. doi:10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023/

  35. [35]

    Image and brain: The resolution of the imagery debate

    Stephen M Kosslyn. Image and brain: The resolution of the imagery debate. MIT Press, 1996

  36. [36]

    Scanning visual images: Some structural implications

    Stephen Michael Kosslyn. Scanning visual images: Some structural implications. Perception & Psychophysics, 14 0 (1): 0 90--94, 1973

  37. [37]

    Looking at mental images: Eye-tracking mental simulation during retrospective causal judgment

    Kristina Krasich, Kevin O'Neill, and Felipe De Brigard. Looking at mental images: Eye-tracking mental simulation during retrospective causal judgment. Cognitive Science, 48 0 (3): 0 e13426, 2024. doi:https://doi.org/10.1111/cogs.13426. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/cogs.13426

  38. [38]

    Phantasia, aphantasia, and hyperphantasia: Empirical data and conceptual considerations

    AJ Larner, AP Leff, and PC Nachev. Phantasia, aphantasia, and hyperphantasia: Empirical data and conceptual considerations. Neuroscience & Biobehavioral Reviews, 164: 0 105819, 2024. ISSN 0149-7634. doi:https://doi.org/10.1016/j.neubiorev.2024.105819. URL https://www.sciencedirect.com/science/article/pii/S0149763424002884

  39. [39]

    Comment on the illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

    Alex Lawsen. Comment on the illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://arxiv.org/abs/2506.09250

  40. [40]

    Revisiting the mental imagery debate: New evidence from aphantasia and neuroimaging, Sep 2025

    Florent Lebon. Revisiting the mental imagery debate: New evidence from aphantasia and neuroimaging, Sep 2025. URL osf.io/preprints/psyarxiv/cfh85_v1

  41. [41]

    Cognitively inspired interpretability in large neural networks

    Anna Leshinskaya, Taylor Webb, Ellie Pavlick, Jiahai Feng, Gustaw Opielka, Claire Stevenson, and Idan A Blank. Cognitively inspired interpretability in large neural networks. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025

  42. [42]

    Quantifying ai psychology: A psychometrics benchmark for large language models, 2024

    Yuan Li, Yue Huang, Hongyi Wang, Xiangliang Zhang, James Zou, and Lichao Sun. Quantifying ai psychology: A psychometrics benchmark for large language models, 2024. URL https://arxiv.org/abs/2406.17675

  43. [43]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 423--439, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19790-1

  44. [44]

    Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L

    Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse, 2025. URL https://arxiv.org/abs/2410.21333

  45. [45]

    Learning by thinking in natural and artificial minds

    Tania Lombrozo. Learning by thinking in natural and artificial minds. Trends in Cognitive Sciences, 28: 0 1011--1022, 2024

  46. [46]

    Lorenzatti

    Joel J. Lorenzatti. Aphantasia: a philosophical approach. Philosophical Psychology, 38 0 (4): 0 1476--1504, 2025. doi:10.1080/09515089.2023.2253854. URL https://doi.org/10.1080/09515089.2023.2253854

  47. [47]

    Uncertainty estimation in autoregressive structured prediction, 2021

    Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction, 2021. URL https://arxiv.org/abs/2002.07650

  48. [48]

    David F. Marks. Visual imagery differences in the recall of pictures. British Journal of Psychology, 64 0 (1): 0 17--24, 1973. doi:https://doi.org/10.1111/j.2044-8295.1973.tb01322.x. URL https://bpspsychub.onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8295.1973.tb01322.x

  49. [49]

    RNNs Implicitly Implement Tensor Product Representations

    R. Thomas McCoy, Tal Linzen, Ewan Dunbar, and Paul Smolensky. Rnns implicitly implement tensor product representations, 2019. URL https://arxiv.org/abs/1812.08718

  50. [50]

    How can deep neural networks inform theory in psychological science? Current Directions in Psychological Science, 33 0 (5): 0 325--333, 2024

    Sam Whitman McGrath, Jacob Russin, Ellie Pavlick, and Roman Feiman. How can deep neural networks inform theory in psychological science? Current Directions in Psychological Science, 33 0 (5): 0 325--333, 2024. doi:10.1177/09637214241268098. URL https://doi.org/10.1177/09637214241268098

  51. [51]

    Aphantasia as imagery blindsight

    Matthias Michel, Jorge Morales, Ned Block, and Hakwan Lau. Aphantasia as imagery blindsight. Trends in Cognitive Sciences, 29 0 (1): 0 8--9, 2025. doi:10.1016/j.tics.2024.11.002

  52. [53]

    Unconscious mental imagery

    Bence Nanay. Unconscious mental imagery. Philosophical Transactions of the Royal Society B: Biological Sciences, 376 0 (1817): 0 20190689, 2021. doi:10.1098/rstb.2019.0689. URL https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2019.0689

  53. [54]

    Olman, Dustin E

    Thomas Naselaris, Cheryl A. Olman, Dustin E. Stansbury, Kamil Ugurbil, and Jack L. Gallant. A voxel-wise encoding model for early visual areas decodes mental images of remembered scenes. NeuroImage, 105: 0 215--228, 2015. ISSN 1053-8119. doi:https://doi.org/10.1016/j.neuroimage.2014.10.018. URL https://www.sciencedirect.com/science/article/pii/S1053811914008428

  54. [55]

    Nisbett and Timothy D

    Richard E. Nisbett and Timothy D. Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84 0 (3): 0 231--259, 1977. doi:doi:10.1037/0033-295X.84.3.231

  55. [56]

    Individual differences in autobiographical memory

    Daniela J Palombo, Signy Sheldon, and Brian Levine. Individual differences in autobiographical memory. Trends in Cognitive Sciences, 22 0 (7): 0 583--597, 2018

  56. [57]

    Mapping language models to grounded conceptual spaces

    Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/pdf?id=gJcEM8sxHK

  57. [58]

    Symbols and grounding in large language models

    Ellie Pavlick. Symbols and grounding in large language models. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381 0 (2251): 0 20220041, 2023. doi:10.1098/rsta.2022.0041. URL https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0041

  58. [59]

    Joel Pearson and Stephen M. Kosslyn. The heterogeneity of mental representation: Ending the imagery debate. Proceedings of the National Academy of Sciences, 112 0 (33): 0 10089--10092, 2015. doi:10.1073/pnas.1504933112. URL https://www.pnas.org/doi/abs/10.1073/pnas.1504933112

  59. [60]

    Mental imagery: functional mechanisms and clinical applications

    Joel Pearson, Thomas Naselaris, Emily A Holmes, and Stephen M Kosslyn. Mental imagery: functional mechanisms and clinical applications. Trends in cognitive sciences, 19 0 (10): 0 590--602, 2015

  60. [61]

    Phillips

    Ian B. Phillips. Aphantasia reimagined. Noûs, pp.\ 1--25, 2025. doi:10.1111/nous.12551

  61. [62]

    Why concepts are (probably) vectors

    Steven T Piantadosi, Dyana CY Muller, Joshua S Rule, Karthikeya Kaushik, Mark Gorenstein, Elena R Leib, and Emily Sanford. Why concepts are (probably) vectors. Trends in Cognitive Sciences, 28 0 (9): 0 844--856, 2024

  62. [63]

    Self-interpretability: Llms can describe complex internal processes that drive their decisions, and improve with training, 2025

    Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales. Self-interpretability: Llms can describe complex internal processes that drive their decisions, and improve with training, 2025. URL https://arxiv.org/abs/2505.17120

  63. [64]

    Eardley, and Juha Silvanto

    Zoë Pounder, Jane Jacob, Samuel Evans, Catherine Loveday, Alison F. Eardley, and Juha Silvanto. Only minimal differences between individuals with congenital aphantasia and those with typical imagery on neuropsychological tasks that involve imagery. Cortex, 148: 0 180--192, 2022. doi:10.1037/h0043158

  64. [65]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210.03350

  65. [66]

    What the mind's eye tells the mind's brain: A critique of mental imagery

    Zenon W Pylyshyn. What the mind's eye tells the mind's brain: A critique of mental imagery. Psychological bulletin, 80 0 (1): 0 1, 1973

  66. [67]

    Pylyshyn

    Zenon W. Pylyshyn. Mental imagery: In search of a theory. Behavioral and Brain Sciences, 25 0 (2): 0 157–182, 2002. doi:10.1017/S0140525X02000043

  67. [68]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  68. [69]

    Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cognition emerge in frontier models?, 2025. URL https://arxiv.org/abs/2410.06468

  69. [70]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

  70. [71]

    The effect of sampling temperature on problem solving in large language models

    Matthew Renze and Erhan Guven. The effect of sampling temperature on problem solving in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 7346--7356, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/20...

  71. [72]

    Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

    Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023. URL https://arxiv.org/abs/2310.18018

  72. [73]

    Shepard and Jacqueline Metzler

    Roger N. Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171 0 (3972): 0 701--703, 1971. doi:10.1126/science.171.3972.701. URL https://www.science.org/doi/abs/10.1126/science.171.3972.701

  73. [74]

    Probing the psychology of ai models

    Richard Shiffrin and Melanie Mitchell. Probing the psychology of ai models. Proceedings of the National Academy of Sciences, 120 0 (10): 0 e2300963120, 2023. doi:10.1073/pnas.2300963120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2300963120

  74. [75]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://arxiv.org/abs/2506.06941

  75. [76]

    Conscious schematic imagery in aphantasia

    Lu Teng. Conscious schematic imagery in aphantasia. Unpublished manuscript, 2025

  76. [77]

    I.—computing machinery and intelligence

    Alan M Turing. I.—computing machinery and intelligence. Mind., 59 0 (236), 1950. ISSN 0026-4423

  77. [78]

    Reasoning with large language models on graph tasks: The influence of temperature

    Yiming Wang, Ziyang Zhang, Hanwei Chen, and Huayi Shen. Reasoning with large language models on graph tasks: The influence of temperature. In 2024 5th International Conference on Computer Engineering and Application (ICCEA), pp.\ 630--634, 2024 a . doi:10.1109/ICCEA62105.2024.10603677

  78. [79]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J...

  79. [80]

    Wheeler, Steven E

    Mark E. Wheeler, Steven E. Petersen, and Randy L. Buckner. Memory's echo: Vivid remembering reactivates sensory-specific cortex. Proceedings of the National Academy of Sciences, 97 0 (20): 0 11125--11129, 2000. doi:10.1073/pnas.97.20.11125. URL https://www.pnas.org/doi/abs/10.1073/pnas.97.20.11125

  80. [81]

    Wright, Matthew W

    David J. Wright, Matthew W. Scott, Sarah N. Kraeutner, Pamela Barhoun, Maurizio Bertollo, Mark J. Campbell, Baptiste M. Waltzing, Stephan F. Dahm, Maaike Esselaar, Cornelia Frank, Robert M. Hardwick, Ian Fuelscher, Ben Marshall, Nicola J. Hodges, Christian Hyde, and Paul S. Holmes. An international estimate of the prevalence of differing visual imagery ab...

Showing first 80 references.