pith. sign in

arxiv: 2605.24585 · v1 · pith:IWUUTII7new · submitted 2026-05-23 · 💻 cs.CL · q-bio.NC

Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language

Pith reviewed 2026-06-30 13:43 UTC · model grok-4.3

classification 💻 cs.CL q-bio.NC
keywords successor representationspart-of-speech categoriesemergent structurepredictive horizonsunsupervised clusteringnatural languagereinforcement learningword embeddings
0
0 comments X

The pith

Successor representations trained on word sequences spontaneously organize into separable part-of-speech clusters without linguistic labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a residual network on WikiText-103 to learn successor representations that encode expected discounted future word distributions across multiple horizons. It demonstrates that the resulting vector space develops a geometric organization in which nouns, verbs, and adjectives form distinct clusters recoverable by unsupervised methods. This organization is strongest at short horizons and incorporates broader contextual information at longer horizons. The work indicates that syntactic categories can arise as a direct consequence of optimizing predictive sequence structure rather than through explicit supervision. A sympathetic reader would care because the result links a reinforcement-learning objective to the spontaneous emergence of linguistic structure from raw text statistics.

Core claim

Optimizing successor representations as probability distributions via KL divergence on raw word sequences from WikiText-103 produces embeddings whose geometry aligns with part-of-speech categories. Nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering, with short predictive horizons yielding the strongest syntactic separation and longer horizons integrating additional semantic information. At finer scales, coherent lexical subclasses appear within the major categories.

What carries the argument

Successor representations that model the expected discounted distribution of future words, trained by matching predicted future occupancy to observed sequences across varying horizons.

If this is right

  • Syntactic categories can arise from modeling long-range transition statistics without any explicit category supervision.
  • The balance between syntactic and semantic information in the learned space is controlled by the length of the predictive horizon.
  • Coherent subclasses within major word categories become visible when the representations are examined at higher resolution.
  • The successor-representation objective supplies a concrete mechanism that connects reinforcement learning to the discovery of linguistic word classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same objective applied to other sequential domains such as music or biological sequences could produce analogous category-like structures.
  • Human syntactic acquisition might exploit successor-like computations to bootstrap word-class distinctions from predictive experience.
  • Testing the method on corpora from additional languages would indicate whether the observed organization depends on language-specific statistics.
  • Varying network architecture while keeping the successor objective fixed could isolate whether emergence requires the residual design used here.

Load-bearing premise

The vector geometry produced by successor-representation training on raw text aligns with human-defined part-of-speech categories closely enough for unsupervised clustering to recover those categories.

What would settle it

If k-means clustering on the trained embeddings shows no greater correspondence to gold-standard part-of-speech tags than clustering on embeddings from a model trained on shuffled text, the claim that syntactic structure emerges from the successor objective would be falsified.

read the original abstract

Language models are typically trained to predict the next token in a sequence. Here, we explore an alternative predictive principle from reinforcement learning: Successor Representations (SRs), which model the expected discounted distribution of future states rather than the immediate next state. We transfer this framework to natural language and train neural networks to predict future word distributions across multiple temporal horizons, thereby learning representations of long-range transition structure. We train a deep residual neural network on WikiText-103 (103 million tokens; 20,000-word vocabulary) and optimize successor representations as probability distributions using KL divergence. Without explicit linguistic supervision, structured language representations emerge spontaneously. After training, the learned space develops a clear geometric organization with respect to part-of-speech (POS) categories: nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering. This organization depends systematically on predictive horizon, with short horizons producing the strongest syntactic structure and longer horizons increasingly integrating broader contextual and semantic information. At finer resolutions, additional interpretable lexical substructure emerges, revealing coherent subclasses within major word categories. These findings suggest that syntactic categories need not be explicitly encoded but may arise as a consequence of predictive sequence learning. To our knowledge, this work provides the first systematic application of successor representations to natural language and establishes a conceptual bridge between reinforcement learning, linguistics, and cognitive neuroscience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that successor representations (SRs), which predict discounted future state distributions rather than next-token probabilities, when trained via KL divergence on raw word sequences from WikiText-103 using a deep residual network, spontaneously produce vector spaces whose geometry aligns with human part-of-speech categories. Nouns, verbs, and adjectives become separable and recoverable by unsupervised clustering, with short predictive horizons yielding strongest syntactic structure and longer horizons incorporating semantic information; finer-grained substructure within categories also emerges. This is presented as evidence that syntactic categories can arise as a byproduct of predictive sequence learning without explicit linguistic supervision.

Significance. If the empirical results hold under quantitative scrutiny, the work would provide the first systematic bridge between successor representations from reinforcement learning and the spontaneous emergence of linguistic structure, with implications for cognitive neuroscience models of language and for unsupervised representation learning. The dependence on horizon offers a tunable mechanism linking short-range syntax to longer-range semantics.

major comments (2)
  1. [Abstract] Abstract: the claim that POS categories 'become separable and recoverable through unsupervised clustering' is presented without any quantitative clustering metrics (e.g., adjusted Rand index, normalized mutual information, or silhouette scores), baseline comparisons, or error bars. This absence is load-bearing for the central emergence claim, as it leaves open whether the observed geometry exceeds statistical artifacts of the corpus or network.
  2. [Abstract] Abstract / training description: the systematic dependence on predictive horizon is asserted, yet no specific horizon values, discount-factor settings, ablation results across horizons, or sensitivity analysis are reported. Because horizons and the discount factor are the explicit free parameters controlling the SR objective, their omission prevents evaluation of whether the reported syntactic-to-semantic transition is robust or an artifact of particular choices.
minor comments (1)
  1. [Abstract] Abstract: the vocabulary size (20,000 words) and corpus (WikiText-103, 103 million tokens) are stated without indicating how the vocabulary was constructed or whether out-of-vocabulary handling affects the clustering results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater quantitative precision in the abstract. We address each point below and will revise the abstract accordingly to strengthen the presentation of our results while preserving the manuscript's core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that POS categories 'become separable and recoverable through unsupervised clustering' is presented without any quantitative clustering metrics (e.g., adjusted Rand index, normalized mutual information, or silhouette scores), baseline comparisons, or error bars. This absence is load-bearing for the central emergence claim, as it leaves open whether the observed geometry exceeds statistical artifacts of the corpus or network.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the clustering claim. The full manuscript demonstrates separability via t-SNE visualizations and unsupervised clustering that recovers POS categories above chance levels, with comparisons to random and shuffled baselines. To address the concern directly, we will revise the abstract to reference these quantitative evaluations (including clustering metrics and error bars from multiple runs) reported in the results section. This makes the emergence claim more precise without altering the underlying findings. revision: yes

  2. Referee: [Abstract] Abstract / training description: the systematic dependence on predictive horizon is asserted, yet no specific horizon values, discount-factor settings, ablation results across horizons, or sensitivity analysis are reported. Because horizons and the discount factor are the explicit free parameters controlling the SR objective, their omission prevents evaluation of whether the reported syntactic-to-semantic transition is robust or an artifact of particular choices.

    Authors: We concur that the abstract should specify the horizon and discount parameters to allow evaluation of the reported transition. The manuscript presents results across a range of horizons in the main text and supplementary figures, documenting the shift from syntactic to semantic structure. We will update the abstract to include the specific horizon values (e.g., short vs. long horizons) and discount factor used, along with reference to the ablation analyses. This revision clarifies the dependence on these parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical training experiment: a residual network is optimized on WikiText-103 to minimize KL divergence between predicted and observed successor distributions over multiple horizons. The reported outcome is that unsupervised clustering on the resulting embeddings recovers POS categories. This is a post-training observational result, not a closed-form derivation whose output is definitionally identical to its inputs. No equations reduce a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work. The geometry-to-POS alignment is an external, falsifiable measurement against human-annotated tags and therefore lies outside any self-referential loop.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; the ledger therefore reflects only the high-level assumptions visible there. The central claim depends on the unstated details of network training, horizon selection, and clustering evaluation.

free parameters (2)
  • prediction horizons
    Multiple temporal horizons are used; their specific values and selection method are not stated in the abstract.
  • discount factor
    Successor representations typically require a discount parameter; its value is not mentioned.
axioms (1)
  • domain assumption Optimizing a neural network with KL divergence on successor distributions produces representations whose geometry reflects syntactic categories.
    The training procedure assumes this optimization yields linguistically meaningful structure.

pith-pipeline@v0.9.1-grok · 5778 in / 1348 out tokens · 55834 ms · 2026-06-30T13:43:18.773093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 6 canonical work pages · 5 internal anchors

  1. [1]

    by richard’s sutton

    Barto, A.G.: Reinforcement learning: An introduction. by richard’s sutton. SIAM Rev6(2), 423 (2021) 32

  2. [2]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  3. [3]

    Neural computation5(4), 613–624 (1993)

    Dayan, P.: Improving generalization for temporal difference learning: The succes- sor representation. Neural computation5(4), 613–624 (1993)

  4. [4]

    Nature neuroscience20(11), 1643–1653 (2017)

    Stachenfeld, K.L., Botvinick, M.M., Gershman, S.J.: The hippocampus as a predictive map. Nature neuroscience20(11), 1643–1653 (2017)

  5. [5]

    Journal of Neuroscience38(33), 7193–7200 (2018)

    Gershman, S.J.: The successor representation: its computational logic and neural substrates. Journal of Neuroscience38(33), 7193–7200 (2018)

  6. [6]

    Advances in neural information processing systems30(2017)

    Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., Hasselt, H.P., Silver, D.: Successor features for transfer in reinforcement learning. Advances in neural information processing systems30(2017)

  7. [7]

    PLoS computational biology13(9), 1005768 (2017)

    Russek, E.M., Momennejad, I., Botvinick, M.M., Gershman, S.J., Daw, N.D.: Predictive representations can link model-based reinforcement learning to model- free mechanisms. PLoS computational biology13(9), 1005768 (2017)

  8. [8]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  9. [9]

    In: European Conference on Computer Vision, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645 (2016). Springer

  10. [10]

    In: International Conference on Machine Learning, pp

    Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: International Conference on Machine Learning, pp. 10524–10533 (2020). PMLR

  11. [11]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  12. [12]

    nature518(7540), 529–533 (2015)

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G.,et al.: Human-level control through deep reinforcement learning. nature518(7540), 529–533 (2015)

  13. [13]

    Soft Actor-Critic Algorithms and Applications

    Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al.: Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018)

  14. [14]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016)

  15. [15]

    Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spacy: Industrial- strength natural language processing in python (2020) https://doi.org/10.5281/ 33 zenodo.1212303

  16. [16]

    Vision research46(19), 3177–3197 (2006)

    Petrov, A.A., Dosher, B.A., Lu, Z.-L.: Perceptual learning without feedback in non-stationary contexts: Data and model. Vision research46(19), 3177–3197 (2006)

  17. [17]

    The annals of mathematical statistics22(1), 79–86 (1951)

    Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics22(1), 79–86 (1951)

  18. [18]

    on lines and planes of closest fit to systems of points in space

    Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science2(11), 559–572 (1901)

  19. [19]

    Journal of educational psychology24(6), 417 (1933)

    Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of educational psychology24(6), 417 (1933)

  20. [20]

    Machine learning52(1), 91–118 (2003)

    Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling- based method for class discovery and visualization of gene expression microarray data. Machine learning52(1), 91–118 (2003)

  21. [21]

    Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning vol

  22. [22]

    Springer, ??? (2006)

  23. [23]

    Journal of machine learning research3(Dec), 583–617 (2002)

    Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for com- bining multiple partitions. Journal of machine learning research3(Dec), 583–617 (2002)

  24. [24]

    Journal of classification2(1), 193– 218 (1985)

    Hubert, L., Arabie, P.: Comparing partitions. Journal of classification2(1), 193– 218 (1985)

  25. [25]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  26. [26]

    OUP Oxford, ??? (2001)

    Croft, W.: Radical Construction Grammar: Syntactic Theory in Typological Perspective. OUP Oxford, ??? (2001)

  27. [27]

    Cambridge University Press, ??? (2010)

    Bybee, J.: Language, Usage and Cognition. Cambridge University Press, ??? (2010)

  28. [28]

    In: Annual Meeting of the Berkeley Linguistics Society, pp

    Hopper, P.: Emergent grammar. In: Annual Meeting of the Berkeley Linguistics Society, pp. 139–157 (1987)

  29. [29]

    University of Chicago Press, ??? (1995)

    Goldberg, A.E.: Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, ??? (1995)

  30. [30]

    OUP Oxford, ??? (2006) 34

    Goldberg, A.E.: Constructions at Work: The Nature of Generalization in Lan- guage. OUP Oxford, ??? (2006) 34

  31. [31]

    Oxford University Press, ??? (2013)

    Hoffmann, T., Trousdale, G.: The Oxford Handbook of Construction Grammar. Oxford University Press, ??? (2013)

  32. [32]

    Language variation and change14(3), 261–290 (2002)

    Bybee, J.: Word frequency and context of use in the lexical diffusion of phonet- ically conditioned sound change. Language variation and change14(3), 261–290 (2002)

  33. [33]

    Cambridge University Press, ??? (2019)

    Diessel, H.: The Grammar Network. Cambridge University Press, ??? (2019)

  34. [34]

    Cognitive linguistics2, 50–80 (2019)

    Diessel, H., Dabrowska, E., Divjak, D.: Usage-based construction grammar. Cognitive linguistics2, 50–80 (2019)

  35. [35]

    Princeton University Press, ??? (2019)

    Goldberg, A.E.: Explain Me This: Creativity, Competition, and the Partial Productivity of Constructions. Princeton University Press, ??? (2019)

  36. [36]

    Nature reviews neuroscience1(1), 41–50 (2000)

    Eichenbaum, H.: A cortical–hippocampal system for declarative memory. Nature reviews neuroscience1(1), 41–50 (2000)

  37. [37]

    Psychological review55(4), 189 (1948)

    Tolman, E.C.: Cognitive maps in rats and men. Psychological review55(4), 189 (1948)

  38. [38]

    elife12, 78904 (2023)

    Ekman, M., Kusch, S., Lange, F.P.: Successor-like representation guides the pre- diction of future events in human visual cortex and hippocampus. elife12, 78904 (2023)

  39. [39]

    elife6, 17086 (2017)

    Garvert, M.M., Dolan, R.J., Behrens, T.E.: A map of abstract relational knowledge in the human hippocampal–entorhinal cortex. elife6, 17086 (2017)

  40. [40]

    Journal of Neuroscience42(2), 299–312 (2022)

    Brunec, I.K., Momennejad, I.: Predictive representations in hippocampal and prefrontal hierarchies. Journal of Neuroscience42(2), 299–312 (2022)

  41. [41]

    Elife12, 80671 (2023)

    Bono, J., Zannone, S., Pedrosa, V., Clopath, C.: Learning predictive cognitive maps with spiking neurons during behavior and replays. Elife12, 80671 (2023)

  42. [42]

    PLoS computational biology21(11), 1013696 (2025)

    Kahn, A.E., Bassett, D.S., Daw, N.D.: Trial-by-trial learning of successor rep- resentations in human behavior. PLoS computational biology21(11), 1013696 (2025)

  43. [43]

    Tomasello, M.: Constructing a Language: A Usage-based Theory of Language Acquisition

  44. [44]

    Trends in cognitive sciences20(7), 512–534 (2016)

    Kumaran, D., Hassabis, D., McClelland, J.L.: What learning systems do intel- ligent agents need? complementary learning systems theory updated. Trends in cognitive sciences20(7), 512–534 (2016)

  45. [45]

    Journal of Artificial Intelligence Research 63, 743–788 (2018) 35

    Camacho-Collados, J., Pilehvar, M.T.: From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research 63, 743–788 (2018) 35

  46. [46]

    In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp

    Erk, K., Pad´ o, S.: A structured vector space model for word meaning in con- text. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 897–906 (2008)

  47. [47]

    distributional view on prepositional polysemy

    Fonteyn, L.: Varying abstractions: a conceptual vs. distributional view on prepositional polysemy. Glossa: a journal of general linguistics6(1) (2021)

  48. [48]

    Oxford University Press, ??? (2013)

    Wierzbicka, A.: Imprisoned in English: The Hazards of English as a Default Language. Oxford University Press, ??? (2013)

  49. [49]

    Language86(3), 663–687 (2010) https://doi.org/10.1353/lan.2010

    Haspelmath, M.: Comparative concepts and descriptive categories in crosslinguis- tic studies. Language86(3), 663–687 (2010) https://doi.org/10.1353/lan.2010. 0021

  50. [50]

    Behavioral and brain sciences31(5), 489–509 (2008)

    Christiansen, M.H., Chater, N.: Language as shaped by the brain. Behavioral and brain sciences31(5), 489–509 (2008)

  51. [51]

    Current opinion in neurobiology28, 108–114 (2014)

    Kirby, S., Griffiths, T., Smith, K.: Iterated learning and the evolution of language. Current opinion in neurobiology28, 108–114 (2014)

  52. [52]

    Frontiers in Psychology6, 1182 (2015)

    Christiansen, M.H., Chater, N.: The language faculty that wasn’t: a usage-based account of natural language recursion. Frontiers in Psychology6, 1182 (2015)

  53. [53]

    Friston, K.: The free-energy principle: a unified brain theory? Nature reviews neuroscience11(2), 127–138 (2010)

  54. [54]

    Behavioral and brain sciences36(3), 181–204 (2013)

    Clark, A.: Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences36(3), 181–204 (2013)

  55. [55]

    Lashley, K.S.,et al.: The Problem of Serial Order in Behavior vol. 21. Bobbs- Merrill Oxford, ??? (1951)

  56. [56]

    In: Psychology of Learning and Motivation vol

    Garrett, M.F.: The analysis of sentence production. In: Psychology of Learning and Motivation vol. 9, pp. 133–177. Elsevier, ??? (1975)

  57. [57]

    Cognitive psychology 18(3), 355–387 (1986)

    Bock, J.K.: Syntactic persistence in language production. Cognitive psychology 18(3), 355–387 (1986)

  58. [58]

    Psychological review93(3), 283 (1986)

    Dell, G.S.: A spreading-activation theory of retrieval in sentence production. Psychological review93(3), 283 (1986)

  59. [59]

    Evidence, experiment and argument in linguistics and philosophy of language, 15–26 (2016)

    Sampson, G., Hinton, M.: Two ideas of creativity. Evidence, experiment and argument in linguistics and philosophy of language, 15–26 (2016)

  60. [60]

    Recherches en communication19, 57–86 (2003)

    Fauconnier, G., Turner, M.: Conceptual blending, form and meaning. Recherches en communication19, 57–86 (2003)

  61. [61]

    Zeitschrift f¨ ur Anglistik und Amerikanistik66(3), 309–328 (2018)

    Herbst, T.: Collo-creativity and blending: Recognizing creativity requires lexical 36 storage in constructional slots. Zeitschrift f¨ ur Anglistik und Amerikanistik66(3), 309–328 (2018)

  62. [62]

    Cognitive Semiotics13(1), 20202027 (2020)

    Uhrig, P.: Creative intentions—the fine line between ‘creative’and ‘wrong’. Cognitive Semiotics13(1), 20202027 (2020)

  63. [63]

    Zeitschrift f¨ ur Anglistik und Amerikanistik66(3), 295–308 (2018)

    Uhrig, P.: I don’t want to go all yoko ono on you. Zeitschrift f¨ ur Anglistik und Amerikanistik66(3), 295–308 (2018)

  64. [64]

    Journal of Foreign Languages and Cultures 8(1), 139–154 (2024)

    Hoffmann, T.: The 5c model of linguistic creativity: Construction grammar as a cognitive theory of verbal creativity. Journal of Foreign Languages and Cultures 8(1), 139–154 (2024)

  65. [65]

    Kriegeskorte, N., Mur, M., Bandettini, P.A.: Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuro- science2, 249 (2008) Appendix A Cluster Token Assignments:γ= 0.2, NVA,k= 3 Cluster 0: same, new, use, own, particular, other, s, known, separate, such, form, individual, level, associated, similar,...