Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law
Pith reviewed 2026-06-26 01:12 UTC · model grok-4.3
The pith
Supervised machine learning cannot attain the rigour of symbolic logical reasoning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sphere neural networks achieve symbolic syllogistic reasoning without training data. However, supervised deep learning faces two methodological limitations that prevent it from reaching the same level: training data cannot distinguish all 24 types of valid syllogistic reasoning, and end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Experimental results show that Euler Net cannot achieve rigorous syllogistic reasoning, and recent ChatGPT models' performance depends on surface forms of statements, sometimes achieving 100% accuracy but providing incorrect explanations. The conclusion
What carries the argument
The two methodological limitations: inability of training data to distinguish the 24 syllogism types and contradictory end-to-end training targets for pattern recognition and logical reasoning.
If this is right
- Training data scaling cannot resolve the indistinguishability of syllogism types.
- End-to-end neural architectures will always face conflicting objectives in logical tasks.
- Models may achieve high accuracy without correct logical explanations.
- Surface form of input affects reasoning performance in language models.
- Symbolic approaches are required to reach rigorous logical reasoning.
Where Pith is reading between the lines
- This limitation may extend to other forms of logical reasoning beyond syllogisms.
- Hybrid systems combining neural networks with symbolic logic could potentially avoid these barriers.
- The finding challenges the idea that scaling alone will solve all AI reasoning problems.
- Future benchmarks for logical reasoning should test for explanation correctness, not just accuracy.
Load-bearing premise
That the two methodological limitations cannot be overcome by any changes in data-driven architectures or training methods.
What would settle it
Demonstrating a data-driven model that correctly performs all 24 syllogism types with rigorous explanations independent of surface form would falsify the claim.
read the original abstract
Sphere neural networks have achieved symbolic level syllogistic reasoning without training data, raising the question of where the limit of the scaling law for logical reasoning lies, i.e., whether data-driven machine learning systems can achieve the same level by increasing training data and training time. We show two methodological limitations that prevent supervised deep learning from reaching the symbolic-level syllogistic reasoning: (1) training data can not distinguish all 24 types of valid syllogistic reasoning; (2) end-to-end mapping from premises to conclusion introduces contradictory training targets between neural components for pattern recognition and logical reasoning. Beside theoretical analysis, we experimentally illustrate that Euler Net cannot achieve rigorous syllogistic reasoning. We further challenge the most recent ChatGPTs (GPT-5-nano and GPT-5) to determine the satisfiability of syllogistic statements in four surface forms (patterns): words, double words, simple symbols, and long random symbols, showing that surface forms affect the reasoning performance and that ChatGPT GPT-5 may reach 100% accuracy but still provide incorrect explanations. As empirical training processes are stopped after achieving 100% accuracy, we conclude that supervised machine learning systems will not attain the rigour of symbolic logical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that supervised deep learning systems cannot reach symbolic-level syllogistic reasoning due to two methodological limitations: training data cannot distinguish all 24 valid syllogism types, and end-to-end premise-to-conclusion mappings create contradictory training targets between pattern recognition and logical components. This is supported by theoretical analysis, experiments showing that Euler Net fails to achieve rigorous syllogistic reasoning, and tests on GPT-5-nano and GPT-5 demonstrating that surface forms affect performance and that models may reach 100% accuracy while providing incorrect explanations. The conclusion is that supervised ML systems will not attain the rigor of symbolic logical reasoning.
Significance. If the result holds, it would challenge the scaling hypothesis for logical reasoning in data-driven ML by identifying fundamental barriers, in contrast to symbolic methods such as sphere neural networks. The paper supplies both theoretical arguments and empirical illustrations on specific models, though the generality of the claimed limits across architectures is central to its impact.
major comments (3)
- [Theoretical analysis] Theoretical analysis: the two limitations are presented as preventing supervised deep learning from reaching symbolic-level reasoning, but the sections do not derive the indistinguishability of the 24 syllogism types or the contradictory gradients as invariants that necessarily apply to every possible data-driven regime (e.g., models trained on explicit reasoning traces, multi-objective losses separating figure/mood classification from conclusion generation, or architectures with differentiable memory).
- [Abstract and conclusion] Abstract and conclusion: the strong claim that 'supervised machine learning systems will not attain the rigour of symbolic logical reasoning' rests on the assertion that the two limitations are fundamental barriers that no alternative data-driven architecture or training regime can circumvent, yet the provided analysis demonstrates the limitations only for standard supervised setups and Euler Net.
- [Experiments] Experiments section: the Euler Net results and GPT-5 surface-form tests illustrate failures under the tested conditions, but without full methods, error analysis, or controls for alternative regimes the support for the impossibility result remains limited, consistent with the unverifiable support noted for the strong negative conclusion.
minor comments (1)
- [Abstract] Abstract: 'Beside theoretical analysis' should read 'Besides theoretical analysis'.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to better scope our claims to standard end-to-end supervised regimes while strengthening the experimental presentation.
read point-by-point responses
-
Referee: [Theoretical analysis] Theoretical analysis: the two limitations are presented as preventing supervised deep learning from reaching symbolic-level reasoning, but the sections do not derive the indistinguishability of the 24 syllogism types or the contradictory gradients as invariants that necessarily apply to every possible data-driven regime (e.g., models trained on explicit reasoning traces, multi-objective losses separating figure/mood classification from conclusion generation, or architectures with differentiable memory).
Authors: The theoretical analysis targets the standard supervised end-to-end paradigm on premise-conclusion pairs. In this setting the 24 valid types are indistinguishable because many share identical input-output surface mappings under typical syllogism corpora, and the single loss produces opposing gradients between pattern recognition and logical deduction. We do not derive these as invariants across all conceivable data-driven regimes; approaches with explicit traces, multi-objective losses, or differentiable memory could in principle avoid them. We will revise the relevant sections to state the scope explicitly and note that alternative structured regimes lie outside the present argument. revision: yes
-
Referee: [Abstract and conclusion] Abstract and conclusion: the strong claim that 'supervised machine learning systems will not attain the rigour of symbolic logical reasoning' rests on the assertion that the two limitations are fundamental barriers that no alternative data-driven architecture or training regime can circumvent, yet the provided analysis demonstrates the limitations only for standard supervised setups and Euler Net.
Authors: We agree the current wording is too broad. The analysis and experiments address standard end-to-end supervised learning and the Euler Net architecture. We will revise the abstract and conclusion to limit the claim to these standard supervised regimes and to remove the implication that no alternative data-driven approach could circumvent the identified limitations. revision: yes
-
Referee: [Experiments] Experiments section: the Euler Net results and GPT-5 surface-form tests illustrate failures under the tested conditions, but without full methods, error analysis, or controls for alternative regimes the support for the impossibility result remains limited, consistent with the unverifiable support noted for the strong negative conclusion.
Authors: We will expand the experiments section with complete methodological details, systematic error analysis, and additional controls. These additions will strengthen the empirical illustration for the regimes we actually tested. As noted in the responses above, we will also clarify that the results do not address alternative regimes. revision: yes
Circularity Check
Minor self-citation present but not load-bearing for central claim
full rationale
The paper's impossibility result is derived from explicit analysis of the 24 syllogism types' data properties and the conflicting gradients in end-to-end premise-to-conclusion mappings. These steps rely on the structure of the syllogistic domain and supervised training objectives rather than any definitional equivalence or fitted parameter. The opening reference to Sphere neural networks motivates the scaling-law question but is not invoked to establish the two limitations or the final conclusion about supervised systems. No equation or claim reduces to a self-citation chain, a renamed empirical pattern, or an ansatz smuggled from prior work by the same authors.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Syllogistic reasoning requires the ability to distinguish all 24 valid forms as distinct
Reference graph
Works this paper leans on
-
[1]
Nature619, 686–689 (2023)
Biever, C.: ChatGPT broke the turing test — the race is on for new ways to assess AI. Nature619, 686–689 (2023)
2023
-
[2]
Nature550, 354–359 (2017)
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature550, 354–359 (2017)
2017
-
[3]
Nature 588, 604–609 (2020) 17
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., Silver, D.: Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2020) 17
2020
-
[4]
Nature (2024)
Abramson, J., Adler, J., Dunger, J., al.: Accurate structure prediction of biomolecular interactions with alphafold 3. Nature (2024)
2024
-
[5]
Nature600(7887), 70–74 (2021)
Davies, A., Velickovic, P., Buesing, L., Blackwell, S., Zheng, D., Tomasev, N., Tanburn, R., Battaglia, P.W., Blundell, C., Juh´ asz, A., Lackenby, M., Williamson, G., Hassabis, D., Kohli, P.: Advancing mathematics by guiding human intuition with AI. Nature600(7887), 70–74 (2021)
2021
-
[6]
Nature625, 476–482 (2024)
Trinh, T.H., Wu, Y., Le, Q.V., He, H., Luong, T.: Solving olympiad geometry without human demonstrations. Nature625, 476–482 (2024)
2024
-
[7]
https://arxiv.org/abs/2001.08361
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models (2020). https://arxiv.org/abs/2001.08361
Pith/arXiv arXiv 2020
-
[8]
Proceedings of the National Academy of Sciences121(27) (2024)
Bahri, Y., Dyer, E., Kaplan, J., Lee, J., Sharma, U.: Explaining neural scaling laws. Proceedings of the National Academy of Sciences121(27) (2024)
2024
-
[9]
Creswell, A., Shanahan, M., Higgins, I.: Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning (2022)
2022
-
[10]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2023)
2023
-
[11]
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s Verify Step by Step (2023)
2023
-
[12]
In: NAACL (2024)
Eisape, T., Tessler, M., Dasgupta, I., Sha, F., Steenkiste, S., Linzen, T.: A sys- tematic comparison of syllogistic reasoning in humans and language models. In: NAACL (2024)
2024
-
[13]
PNAS Nexus3(7) (2024)
Lampinen, A.K., Dasgupta, I., Chan, S.C.Y., Sheahan, H.R., Creswell, A., Kumaran, D., McClelland, J.L., Hill, F.: Language models, like humans, show content effects on reasoning tasks. PNAS Nexus3(7) (2024)
2024
-
[14]
https://arxiv.org/abs/ 2408.08590
Kim, G., Valentino, M., Freitas, A.: A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models (2025). https://arxiv.org/abs/ 2408.08590
arXiv 2025
-
[15]
Lawrence Erlbaum Associates, Inc., Hove, HK, Hillsdale, USA (1991)
Johnson-Laird, P.N., Byrne, R.M.J.: Deduction. Lawrence Erlbaum Associates, Inc., Hove, HK, Hillsdale, USA (1991)
1991
-
[16]
Journal of Cognitive Neuroscience15(4), 559–573 (2003)
Knauff, M., Fangmeier, T., Ruff, C.C., Johnson-Laird, P.N.: Reasoning, mod- els, and images: behavioral measures and cortical activity. Journal of Cognitive Neuroscience15(4), 559–573 (2003)
2003
-
[17]
Psychological review 18 112, 468–93 (2005)
Goodwin, G., Johnson-Laird, P.: Reasoning about relations. Psychological review 18 112, 468–93 (2005)
2005
-
[18]
Spatial Cognition & Computation9(2), 109–137 (2009)
Knauff, M.: A neuro-cognitive theory of deductive relational reasoning with men- tal models and visual images. Spatial Cognition & Computation9(2), 109–137 (2009)
2009
-
[19]
arXiv:2403.15297 (2024) [cs.AI]
Dong, T., Jamnik, M., Li` o, P.: Sphere Neural-Networks for Rational Reasoning. arXiv:2403.15297 (2024) [cs.AI]
arXiv 2024
-
[20]
In: AAAI (2025)
Dong, T., Jamnik, M., Li` o, P.: Neural Reasoning for Sure Through Constructing Explainable Models. In: AAAI (2025)
2025
-
[21]
In: Bouamor, H., Pino, J., Bali, K
Nowak, F., Svete, A., Du, L., Cotterell, R.: On the representational capac- ity of recurrent neural language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing, pp. 7011–7034. Association for Computational Lin- guistics, Singapore (2023). https://doi.org/10.18653/v1/2...
-
[22]
In-Context Retrieval-Augmented Language Models , journal =
Strobl, L., Merrill, W., Weiss, G., Chiang, D., Angluin, D.: What formal languages can transformers express? a survey. Transactions of the Association for Compu- tational Linguistics12, 543–561 (2024) https://doi.org/10.1162/tacl a 00663
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[23]
Bloomsbury Academic, ??? (2017)
Malpass, A., Marfori, M.A.: The History of Philosophical and Formal Logic: From Aristotle to Tarski. Bloomsbury Academic, ??? (2017)
2017
-
[24]
Data and Knowledge Engineering20, 287–303 (1996)
Smith, B.: Mereotopology: A Theory of Parts and Boundaries. Data and Knowledge Engineering20, 287–303 (1996)
1996
-
[25]
New York, NY:McGraw- Hill, ??? (1981)
Jeffrey, R.: Formal Logic: Its Scope and Limits (2nd Ed.). New York, NY:McGraw- Hill, ??? (1981)
1981
-
[26]
Psychological Bulletin138(3), 427–457 (2012)
Khemlani, S., Johnson-Laird, P.N.: Theories of the syllogism: A meta-analysis. Psychological Bulletin138(3), 427–457 (2012)
2012
-
[27]
In: TACAS (1), vol
Vukmirovic, P., Blanchette, J.C., Cruanes, S., Schulz, S.: Extending a brainiac prover to lambda-free higher-order logic. In: TACAS (1), vol. 11427, pp. 192–210. Springer, ??? (2019)
2019
-
[28]
In: CADE, vol
Bentkamp, A., Blanchette, J., Tourret, S., Vukmirovic, P.: Superposition for full higher-order logic. In: CADE, vol. 12699, pp. 396–412. Springer, ??? (2021)
2021
-
[29]
Journal of Computer and System Sciences50(1), 132–150 (1995)
On the computational power of neural nets. Journal of Computer and System Sciences50(1), 132–150 (1995)
1995
-
[30]
In: Diagrams 2018, pp
Wang, D., Jamnik, M., Li` o, P.: Investigating diagrammatic reasoning with deep neural networks. In: Diagrams 2018, pp. 390–398 (2018) 19
2018
-
[31]
In: ICLR (2020)
Wang, D., Jamnik, M., Li` o, P.: Abstract diagrammatic reasoning with multiplex graph networks. In: ICLR (2020)
2020
-
[32]
https://arxiv.org/abs/2502.00212
Dong, K., Ma, T.: STP: Self-play LLM Theorem Provers with Iterative Conjec- turing and Proving (2025). https://arxiv.org/abs/2502.00212
arXiv 2025
-
[33]
https://arxiv.org/abs/2502.07640
Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving (2025). https://arxiv.org/abs/2502.07640
arXiv 2025
-
[34]
In: ICLR (2025)
Li, Z., Liu, H., Zhou, D., Ma, T.: Chain of thought empowers transformers to solve inherently serial problems. In: ICLR (2025)
2025
-
[35]
Google: Palm 2 technical report (2023) arXiv:2305.10403 [cs.CL]
Pith/arXiv arXiv 2023
-
[36]
https://arxiv.org/abs/2307.09288
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabs...
Pith/arXiv arXiv 2023
-
[37]
OpenAI: GPT-3.5 (2023)
2023
-
[38]
arXiv:2410.14399 (2025) [cs.CL]
Wysocka, M., Carvalho, D., Wysocki, O., Valentino, M., Freitas, A.: SylloBio- NLI: Evaluating large language models on biomedical syllogistic reasoning. arXiv:2410.14399 (2025) [cs.CL]
arXiv 2025
-
[39]
Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (2023) arXiv:2310.06825 [cs.CL]
Pith/arXiv arXiv 2023
-
[40]
Mistral, A.T.: Mixtral of experts- a high quality sparse mixture-of-experts (2023)
2023
-
[41]
https://arxiv.org/abs/2403.08295
Gemma, T., Google, D.: Gemma: Open Models Based on Gemini Research and Technology (2024). https://arxiv.org/abs/2403.08295
Pith/arXiv arXiv 2024
-
[42]
MetaAI: The llama 3 herd of models (2024) arXiv:2407.21783 [cs.AI]
Pith/arXiv arXiv 2024
-
[43]
In: Ku, L.-W., Martins, A., Srikumar, V
Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., Dufour, R.: 20 BioMistral: A collection of open-source pretrained large language models for med- ical domains. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024, pp. 5848–5864. Association for Computational Linguistics, Bangko...
2024
-
[44]
History and Philosophy of Logic19(1) (1998)
Hammer, E., Shin, S.J.: Eulers visual logic. History and Philosophy of Logic19(1) (1998)
1998
-
[45]
In: Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 15979–15988. IEEE Computer Society, ??? (2022)
2022
-
[46]
In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Gupta, A., Wu, J., Deng, J., Fei-Fei, L.: Siamese masked autoencoders. In: Thirty-seventh Conference on Neural Information Processing Systems (2023). https://openreview.net/forum?id=yC3q7vInux
2023
-
[47]
MIT Press, Cambridge, MA (2019)
Simon, H.A.: The Sciences of the Artificial. MIT Press, Cambridge, MA (2019)
2019
-
[48]
Manning, ??? (2024)
Raschka, S.: Build A Large Language Model (From Scratch). Manning, ??? (2024)
2024
-
[49]
In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., Long Beach, USA (2017)
2017
-
[50]
Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences478(2022) https://doi.org/10.1098/rspa.2021.0068
-
[51]
Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM38(11), 39–41 (1995) 21 A The list of 24 valid types of syllogistic reasoning T able 3List of all 24 valid syllogisms, each having a name whose vowels indicate types of moods, e.g., vowels in ‘CE LARENT’ indicateuniversal negative (E),universal affirmative(A), anduniversal negative(E). Num ...
1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.