pith. sign in

arxiv: 2606.26454 · v1 · pith:GRISKCCInew · submitted 2026-06-24 · 💻 cs.AI

Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law

Pith reviewed 2026-06-26 01:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords syllogistic reasoningscaling lawsmachine learning limitssymbolic reasoningneural networkslogical reasoningChatGPTdata-driven AI
0
0 comments X

The pith

Supervised machine learning cannot attain the rigour of symbolic logical reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data-driven machine learning systems are fundamentally limited in achieving symbolic-level logical reasoning for syllogisms. It identifies two barriers that scaling cannot overcome: training data cannot distinguish all 24 valid syllogism types, and end-to-end training creates contradictory targets for pattern recognition versus logical reasoning. A reader would care because this suggests that increasing data and compute has a hard ceiling for rigorous logic tasks, unlike symbolic methods that can reason without training data. Experiments with Euler Net and GPT models support that accuracy alone does not ensure logical rigor.

Core claim

Sphere neural networks achieve symbolic syllogistic reasoning without training data. However, supervised deep learning faces two methodological limitations that prevent it from reaching the same level: training data cannot distinguish all 24 types of valid syllogistic reasoning, and end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Experimental results show that Euler Net cannot achieve rigorous syllogistic reasoning, and recent ChatGPT models' performance depends on surface forms of statements, sometimes achieving 100% accuracy but providing incorrect explanations. The conclusion

What carries the argument

The two methodological limitations: inability of training data to distinguish the 24 syllogism types and contradictory end-to-end training targets for pattern recognition and logical reasoning.

If this is right

  • Training data scaling cannot resolve the indistinguishability of syllogism types.
  • End-to-end neural architectures will always face conflicting objectives in logical tasks.
  • Models may achieve high accuracy without correct logical explanations.
  • Surface form of input affects reasoning performance in language models.
  • Symbolic approaches are required to reach rigorous logical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This limitation may extend to other forms of logical reasoning beyond syllogisms.
  • Hybrid systems combining neural networks with symbolic logic could potentially avoid these barriers.
  • The finding challenges the idea that scaling alone will solve all AI reasoning problems.
  • Future benchmarks for logical reasoning should test for explanation correctness, not just accuracy.

Load-bearing premise

That the two methodological limitations cannot be overcome by any changes in data-driven architectures or training methods.

What would settle it

Demonstrating a data-driven model that correctly performs all 24 syllogism types with rigorous explanations independent of surface form would falsify the claim.

read the original abstract

Sphere neural networks have achieved symbolic level syllogistic reasoning without training data, raising the question of where the limit of the scaling law for logical reasoning lies, i.e., whether data-driven machine learning systems can achieve the same level by increasing training data and training time. We show two methodological limitations that prevent supervised deep learning from reaching the symbolic-level syllogistic reasoning: (1) training data can not distinguish all 24 types of valid syllogistic reasoning; (2) end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Beside theoretical analysis, we experimentally illustrate that Euler Net cannot achieve rigorous syllogistic reasoning. We further challenge the most recent ChatGPTs (GPT-5-nano and GPT-5) to determine the satisfiability of syllogistic statements in four surface forms (patterns): words, double words, simple symbols, and long random symbols, showing that surface forms affect the reasoning performance and that ChatGPT GPT-5 may reach 100% accuracy but still provide incorrect explanations. As empirical training processes are stopped after achieving 100% accuracy, we conclude that supervised machine learning systems will not attain the rigour of symbolic logical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that supervised deep learning systems cannot reach symbolic-level syllogistic reasoning due to two methodological limitations: training data cannot distinguish all 24 valid syllogism types, and end-to-end premise-to-conclusion mappings create contradictory training targets between pattern recognition and logical components. This is supported by theoretical analysis, experiments showing that Euler Net fails to achieve rigorous syllogistic reasoning, and tests on GPT-5-nano and GPT-5 demonstrating that surface forms affect performance and that models may reach 100% accuracy while providing incorrect explanations. The conclusion is that supervised ML systems will not attain the rigor of symbolic logical reasoning.

Significance. If the result holds, it would challenge the scaling hypothesis for logical reasoning in data-driven ML by identifying fundamental barriers, in contrast to symbolic methods such as sphere neural networks. The paper supplies both theoretical arguments and empirical illustrations on specific models, though the generality of the claimed limits across architectures is central to its impact.

major comments (3)
  1. [Theoretical analysis] Theoretical analysis: the two limitations are presented as preventing supervised deep learning from reaching symbolic-level reasoning, but the sections do not derive the indistinguishability of the 24 syllogism types or the contradictory gradients as invariants that necessarily apply to every possible data-driven regime (e.g., models trained on explicit reasoning traces, multi-objective losses separating figure/mood classification from conclusion generation, or architectures with differentiable memory).
  2. [Abstract and conclusion] Abstract and conclusion: the strong claim that 'supervised machine learning systems will not attain the rigour of symbolic logical reasoning' rests on the assertion that the two limitations are fundamental barriers that no alternative data-driven architecture or training regime can circumvent, yet the provided analysis demonstrates the limitations only for standard supervised setups and Euler Net.
  3. [Experiments] Experiments section: the Euler Net results and GPT-5 surface-form tests illustrate failures under the tested conditions, but without full methods, error analysis, or controls for alternative regimes the support for the impossibility result remains limited, consistent with the unverifiable support noted for the strong negative conclusion.
minor comments (1)
  1. [Abstract] Abstract: 'Beside theoretical analysis' should read 'Besides theoretical analysis'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to better scope our claims to standard end-to-end supervised regimes while strengthening the experimental presentation.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis: the two limitations are presented as preventing supervised deep learning from reaching symbolic-level reasoning, but the sections do not derive the indistinguishability of the 24 syllogism types or the contradictory gradients as invariants that necessarily apply to every possible data-driven regime (e.g., models trained on explicit reasoning traces, multi-objective losses separating figure/mood classification from conclusion generation, or architectures with differentiable memory).

    Authors: The theoretical analysis targets the standard supervised end-to-end paradigm on premise-conclusion pairs. In this setting the 24 valid types are indistinguishable because many share identical input-output surface mappings under typical syllogism corpora, and the single loss produces opposing gradients between pattern recognition and logical deduction. We do not derive these as invariants across all conceivable data-driven regimes; approaches with explicit traces, multi-objective losses, or differentiable memory could in principle avoid them. We will revise the relevant sections to state the scope explicitly and note that alternative structured regimes lie outside the present argument. revision: yes

  2. Referee: [Abstract and conclusion] Abstract and conclusion: the strong claim that 'supervised machine learning systems will not attain the rigour of symbolic logical reasoning' rests on the assertion that the two limitations are fundamental barriers that no alternative data-driven architecture or training regime can circumvent, yet the provided analysis demonstrates the limitations only for standard supervised setups and Euler Net.

    Authors: We agree the current wording is too broad. The analysis and experiments address standard end-to-end supervised learning and the Euler Net architecture. We will revise the abstract and conclusion to limit the claim to these standard supervised regimes and to remove the implication that no alternative data-driven approach could circumvent the identified limitations. revision: yes

  3. Referee: [Experiments] Experiments section: the Euler Net results and GPT-5 surface-form tests illustrate failures under the tested conditions, but without full methods, error analysis, or controls for alternative regimes the support for the impossibility result remains limited, consistent with the unverifiable support noted for the strong negative conclusion.

    Authors: We will expand the experiments section with complete methodological details, systematic error analysis, and additional controls. These additions will strengthen the empirical illustration for the regimes we actually tested. As noted in the responses above, we will also clarify that the results do not address alternative regimes. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but not load-bearing for central claim

full rationale

The paper's impossibility result is derived from explicit analysis of the 24 syllogism types' data properties and the conflicting gradients in end-to-end premise-to-conclusion mappings. These steps rely on the structure of the syllogistic domain and supervised training objectives rather than any definitional equivalence or fitted parameter. The opening reference to Sphere neural networks motivates the scaling-law question but is not invoked to establish the two limitations or the final conclusion about supervised systems. No equation or claim reduces to a self-citation chain, a renamed empirical pattern, or an ansatz smuggled from prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or ad-hoc axioms are introduced in the abstract; the argument relies on domain assumptions about syllogistic logic.

axioms (1)
  • domain assumption Syllogistic reasoning requires the ability to distinguish all 24 valid forms as distinct
    This underpins the first methodological limitation claim.

pith-pipeline@v0.9.1-grok · 5754 in / 1134 out tokens · 20753 ms · 2026-06-26T01:12:13.887766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Nature619, 686–689 (2023)

    Biever, C.: ChatGPT broke the turing test — the race is on for new ways to assess AI. Nature619, 686–689 (2023)

  2. [2]

    Nature550, 354–359 (2017)

    Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature550, 354–359 (2017)

  3. [3]

    Nature 588, 604–609 (2020) 17

    Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., Silver, D.: Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2020) 17

  4. [4]

    Nature (2024)

    Abramson, J., Adler, J., Dunger, J., al.: Accurate structure prediction of biomolecular interactions with alphafold 3. Nature (2024)

  5. [5]

    Nature600(7887), 70–74 (2021)

    Davies, A., Velickovic, P., Buesing, L., Blackwell, S., Zheng, D., Tomasev, N., Tanburn, R., Battaglia, P.W., Blundell, C., Juh´ asz, A., Lackenby, M., Williamson, G., Hassabis, D., Kohli, P.: Advancing mathematics by guiding human intuition with AI. Nature600(7887), 70–74 (2021)

  6. [6]

    Nature625, 476–482 (2024)

    Trinh, T.H., Wu, Y., Le, Q.V., He, H., Luong, T.: Solving olympiad geometry without human demonstrations. Nature625, 476–482 (2024)

  7. [7]

    https://arxiv.org/abs/2001.08361

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models (2020). https://arxiv.org/abs/2001.08361

  8. [8]

    Proceedings of the National Academy of Sciences121(27) (2024)

    Bahri, Y., Dyer, E., Kaplan, J., Lee, J., Sharma, U.: Explaining neural scaling laws. Proceedings of the National Academy of Sciences121(27) (2024)

  9. [9]

    Creswell, A., Shanahan, M., Higgins, I.: Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning (2022)

  10. [10]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2023)

  11. [11]

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s Verify Step by Step (2023)

  12. [12]

    In: NAACL (2024)

    Eisape, T., Tessler, M., Dasgupta, I., Sha, F., Steenkiste, S., Linzen, T.: A sys- tematic comparison of syllogistic reasoning in humans and language models. In: NAACL (2024)

  13. [13]

    PNAS Nexus3(7) (2024)

    Lampinen, A.K., Dasgupta, I., Chan, S.C.Y., Sheahan, H.R., Creswell, A., Kumaran, D., McClelland, J.L., Hill, F.: Language models, like humans, show content effects on reasoning tasks. PNAS Nexus3(7) (2024)

  14. [14]

    https://arxiv.org/abs/ 2408.08590

    Kim, G., Valentino, M., Freitas, A.: A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models (2025). https://arxiv.org/abs/ 2408.08590

  15. [15]

    Lawrence Erlbaum Associates, Inc., Hove, HK, Hillsdale, USA (1991)

    Johnson-Laird, P.N., Byrne, R.M.J.: Deduction. Lawrence Erlbaum Associates, Inc., Hove, HK, Hillsdale, USA (1991)

  16. [16]

    Journal of Cognitive Neuroscience15(4), 559–573 (2003)

    Knauff, M., Fangmeier, T., Ruff, C.C., Johnson-Laird, P.N.: Reasoning, mod- els, and images: behavioral measures and cortical activity. Journal of Cognitive Neuroscience15(4), 559–573 (2003)

  17. [17]

    Psychological review 18 112, 468–93 (2005)

    Goodwin, G., Johnson-Laird, P.: Reasoning about relations. Psychological review 18 112, 468–93 (2005)

  18. [18]

    Spatial Cognition & Computation9(2), 109–137 (2009)

    Knauff, M.: A neuro-cognitive theory of deductive relational reasoning with men- tal models and visual images. Spatial Cognition & Computation9(2), 109–137 (2009)

  19. [19]

    arXiv:2403.15297 (2024) [cs.AI]

    Dong, T., Jamnik, M., Li` o, P.: Sphere Neural-Networks for Rational Reasoning. arXiv:2403.15297 (2024) [cs.AI]

  20. [20]

    In: AAAI (2025)

    Dong, T., Jamnik, M., Li` o, P.: Neural Reasoning for Sure Through Constructing Explainable Models. In: AAAI (2025)

  21. [21]

    In: Bouamor, H., Pino, J., Bali, K

    Nowak, F., Svete, A., Du, L., Cotterell, R.: On the representational capac- ity of recurrent neural language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing, pp. 7011–7034. Association for Computational Lin- guistics, Singapore (2023). https://doi.org/10.18653/v1/2...

  22. [22]

    In-Context Retrieval-Augmented Language Models , journal =

    Strobl, L., Merrill, W., Weiss, G., Chiang, D., Angluin, D.: What formal languages can transformers express? a survey. Transactions of the Association for Compu- tational Linguistics12, 543–561 (2024) https://doi.org/10.1162/tacl a 00663

  23. [23]

    Bloomsbury Academic, ??? (2017)

    Malpass, A., Marfori, M.A.: The History of Philosophical and Formal Logic: From Aristotle to Tarski. Bloomsbury Academic, ??? (2017)

  24. [24]

    Data and Knowledge Engineering20, 287–303 (1996)

    Smith, B.: Mereotopology: A Theory of Parts and Boundaries. Data and Knowledge Engineering20, 287–303 (1996)

  25. [25]

    New York, NY:McGraw- Hill, ??? (1981)

    Jeffrey, R.: Formal Logic: Its Scope and Limits (2nd Ed.). New York, NY:McGraw- Hill, ??? (1981)

  26. [26]

    Psychological Bulletin138(3), 427–457 (2012)

    Khemlani, S., Johnson-Laird, P.N.: Theories of the syllogism: A meta-analysis. Psychological Bulletin138(3), 427–457 (2012)

  27. [27]

    In: TACAS (1), vol

    Vukmirovic, P., Blanchette, J.C., Cruanes, S., Schulz, S.: Extending a brainiac prover to lambda-free higher-order logic. In: TACAS (1), vol. 11427, pp. 192–210. Springer, ??? (2019)

  28. [28]

    In: CADE, vol

    Bentkamp, A., Blanchette, J., Tourret, S., Vukmirovic, P.: Superposition for full higher-order logic. In: CADE, vol. 12699, pp. 396–412. Springer, ??? (2021)

  29. [29]

    Journal of Computer and System Sciences50(1), 132–150 (1995)

    On the computational power of neural nets. Journal of Computer and System Sciences50(1), 132–150 (1995)

  30. [30]

    In: Diagrams 2018, pp

    Wang, D., Jamnik, M., Li` o, P.: Investigating diagrammatic reasoning with deep neural networks. In: Diagrams 2018, pp. 390–398 (2018) 19

  31. [31]

    In: ICLR (2020)

    Wang, D., Jamnik, M., Li` o, P.: Abstract diagrammatic reasoning with multiplex graph networks. In: ICLR (2020)

  32. [32]

    https://arxiv.org/abs/2502.00212

    Dong, K., Ma, T.: STP: Self-play LLM Theorem Provers with Iterative Conjec- turing and Proving (2025). https://arxiv.org/abs/2502.00212

  33. [33]

    https://arxiv.org/abs/2502.07640

    Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving (2025). https://arxiv.org/abs/2502.07640

  34. [34]

    In: ICLR (2025)

    Li, Z., Liu, H., Zhou, D., Ma, T.: Chain of thought empowers transformers to solve inherently serial problems. In: ICLR (2025)

  35. [35]

    Google: Palm 2 technical report (2023) arXiv:2305.10403 [cs.CL]

  36. [36]

    https://arxiv.org/abs/2307.09288

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabs...

  37. [37]

    OpenAI: GPT-3.5 (2023)

  38. [38]

    arXiv:2410.14399 (2025) [cs.CL]

    Wysocka, M., Carvalho, D., Wysocki, O., Valentino, M., Freitas, A.: SylloBio- NLI: Evaluating large language models on biomedical syllogistic reasoning. arXiv:2410.14399 (2025) [cs.CL]

  39. [39]

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2023) arXiv:2310.06825 [cs.CL]

  40. [40]

    Mistral, A.T.: Mixtral of experts- a high quality sparse mixture-of-experts (2023)

  41. [41]

    https://arxiv.org/abs/2403.08295

    Gemma, T., Google, D.: Gemma: Open Models Based on Gemini Research and Technology (2024). https://arxiv.org/abs/2403.08295

  42. [42]

    MetaAI: The llama 3 herd of models (2024) arXiv:2407.21783 [cs.AI]

  43. [43]

    In: Ku, L.-W., Martins, A., Srikumar, V

    Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: 20 BioMistral: A collection of open-source pretrained large language models for med- ical domains. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024, pp. 5848–5864. Association for Computational Linguistics, Bangko...

  44. [44]

    History and Philosophy of Logic19(1) (1998)

    Hammer, E., Shin, S.J.: Eulers visual logic. History and Philosophy of Logic19(1) (1998)

  45. [45]

    In: Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

    He, K., Chen, X., Xie, S., Li, Y., Dollar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 15979–15988. IEEE Computer Society, ??? (2022)

  46. [46]

    In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

    Gupta, A., Wu, J., Deng, J., Fei-Fei, L.: Siamese masked autoencoders. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=yC3q7vInux

  47. [47]

    MIT Press, Cambridge, MA (2019)

    Simon, H.A.: The Sciences of the Artificial. MIT Press, Cambridge, MA (2019)

  48. [48]

    Manning, ??? (2024)

    Raschka, S.: Build A Large Language Model (From Scratch). Manning, ??? (2024)

  49. [49]

    In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., Long Beach, USA (2017)

  50. [50]

    Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences478(2022) https://doi.org/10.1098/rspa.2021.0068

    Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences478(2022) https://doi.org/10.1098/rspa.2021.0068

  51. [51]

    Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM38(11), 39–41 (1995) 21 A The list of 24 valid types of syllogistic reasoning T able 3List of all 24 valid syllogisms, each having a name whose vowels indicate types of moods, e.g., vowels in ‘CE LARENT’ indicateuniversal negative (E),universal affirmative(A), anduniversal negative(E). Num ...