pith. machine review for the scientific record. sign in

arxiv: 2602.13934 · v4 · submitted 2026-02-15 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords machine learning limitsinformation structurecode generationreinforcement learninglearnability hierarchyscaling lawsfeedback densityexpressibility
0
0 comments X

The pith

The ceiling on machine learning progress is set by the information structure of the task rather than model size or algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that machine learning outcomes depend primarily on the information structure of the task itself. Code generation advances reliably because it supplies dense, local, and verifiable feedback at every token, while most reinforcement learning problems lack equivalent signals. The work introduces a five-level hierarchy of learnability derived from distinctions among expressibility, computability, and learnability to classify tasks and explain why scaling behaves differently across domains. This view challenges the assumption that larger models alone will overcome limits on tasks low in the hierarchy.

Core claim

The central claim is that limits on machine learning are determined by the information structure of the task. The paper distinguishes expressibility, computability, and learnability as separate properties of computational problems, establishes their relationships, and proposes a five-level hierarchy of learnability that ranks tasks by feedback quality. This hierarchy accounts for why supervised learning on code scales predictably while reinforcement learning generally does not.

What carries the argument

A five-level hierarchy of learnability that classifies tasks according to the density, locality, and verifiability of their feedback signals.

If this is right

  • Supervised learning on code will continue to scale predictably with model size because of its position in the hierarchy.
  • Reinforcement learning tasks lacking dense local feedback will hit performance ceilings independent of model scale.
  • Diagnosing a task's place in the hierarchy will prove more predictive of scaling success than model architecture choices.
  • The assumption that scaling laws apply uniformly across machine learning requires revision for tasks with sparse feedback.
  • A unified template comparing computational problems makes explicit where implications among expressibility, computability, and learnability hold or fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tasks could be engineered to rise in the hierarchy by adding mechanisms for local verification.
  • The framework suggests prioritizing feedback-dense data collection over pure increases in model capacity.
  • Hybrid systems that embed code-like verification into reinforcement learning environments might improve scaling.
  • Quantifying information density per task could yield practical tests for predicting scaling exponents in advance.

Load-bearing premise

The proposed five-level hierarchy accurately predicts which tasks will exhibit reliable scaling behavior.

What would settle it

Demonstration of a task low in the hierarchy that nevertheless scales reliably with model size, or a high-hierarchy task that fails to scale despite dense feedback.

read the original abstract

This paper offers a new perspective on the limits of machine learning: the ceiling on progress is set not by model size or algorithm choice but by the information structure of the task itself. Code generation has progressed more reliably than reinforcement learning, largely because code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that diagnosing a task's position in this hierarchy is more predictive of scaling outcomes than any property of the model. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that limits on machine learning progress are set by the information structure of the task itself rather than model size or algorithm choice. It proposes a five-level hierarchy of learnability derived from formal distinctions among expressibility, computability, and learnability, along with a unified template for their relationships. This framework is used to explain why supervised learning on code exhibits reliable scaling while reinforcement learning does not, attributing the difference to the density and verifiability of feedback signals, and argues that diagnosing a task's position in the hierarchy is more predictive of scaling outcomes than model properties.

Significance. If the hierarchy can be made precise with explicit level definitions, task mappings, and empirical links to scaling behavior, the work would offer a valuable conceptual tool for understanding why certain domains scale predictably while others do not. It could shift research focus toward characterizing task information structure and away from purely scaling-based assumptions. The absence of derivations, concrete examples, or validation in the current manuscript limits its immediate impact to a high-level perspective rather than a demonstrated predictor.

major comments (2)
  1. [Abstract and hierarchy proposal] The central claim that the five-level hierarchy predicts scaling reliability rests on distinctions among expressibility, computability, and learnability, but the manuscript provides no explicit definitions of the five levels, no assignment of concrete tasks (e.g., code generation vs. RL) to specific levels, and no derivation showing how level membership maps to observable quantities such as scaling exponents or feedback density. This renders the predictive assertion untested.
  2. [Formal distinctions and unified template] The abstract states that pairwise relationships among expressibility, computability, and learnability are established 'including where implications hold and where they fail,' yet no such relationships, proofs, or counterexamples are supplied in the text. Without these, the unified template cannot be evaluated for internal consistency or novelty relative to standard computability theory.
minor comments (1)
  1. [Abstract] The abstract refers to 'dense, local, verifiable feedback at every token' for code but does not quantify this property or contrast it formally with RL feedback structures; adding a brief operational definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the original manuscript presented the hierarchy and formal relationships at too high a level, limiting testability. The revised version adds explicit level definitions, task mappings, a derivation linking levels to scaling exponents via feedback density, and a section with proofs, implications, and counterexamples for the expressibility-computability-learnability relationships.

read point-by-point responses
  1. Referee: [Abstract and hierarchy proposal] The central claim that the five-level hierarchy predicts scaling reliability rests on distinctions among expressibility, computability, and learnability, but the manuscript provides no explicit definitions of the five levels, no assignment of concrete tasks (e.g., code generation vs. RL) to specific levels, and no derivation showing how level membership maps to observable quantities such as scaling exponents or feedback density. This renders the predictive assertion untested.

    Authors: We agree the original text was insufficiently precise. The revision adds Section 3 with explicit definitions: Level 1 requires dense, local, per-token verifiable feedback; Level 2 has verifiable but delayed feedback; Level 3 has partial verifiability; Level 4 has sparse non-verifiable rewards; Level 5 has no reliable signal. Code generation is assigned to Level 1 and typical RL to Level 4. We derive the scaling link by modeling feedback density as inversely proportional to gradient variance, yielding a closed-form relation where lower levels produce scaling exponents closer to the information-theoretic optimum. This makes the predictive claim directly testable against existing scaling curves. revision: yes

  2. Referee: [Formal distinctions and unified template] The abstract states that pairwise relationships among expressibility, computability, and learnability are established 'including where implications hold and where they fail,' yet no such relationships, proofs, or counterexamples are supplied in the text. Without these, the unified template cannot be evaluated for internal consistency or novelty relative to standard computability theory.

    Authors: The revision includes a new Section 4 that states and proves the relationships. Expressibility implies computability by definition, but neither implies learnability; counterexamples include the halting problem (expressible and computable yet unlearnable from finite samples) and certain sparse-reward MDPs (computable but not PAC-learnable). The unified template is given as an implication diagram plus a table of where each implication fails. Novelty lies in extending classical computability to the finite-data, noisy-feedback regime relevant to scaling laws. revision: yes

Circularity Check

0 steps flagged

No circularity: hierarchy derived from independent computability distinctions

full rationale

The paper proposes a five-level hierarchy of learnability resting on formal distinctions among expressibility, computability, and learnability, then establishes their pairwise relationships and a unified template. No step reduces a prediction or central claim to a fitted parameter, self-definition, or self-citation chain; the distinctions are presented as external formal properties used to classify tasks, with no equations or definitions shown to be equivalent by construction to the scaling predictions they support. The derivation remains self-contained against external benchmarks of computability theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption that information structure is the primary determinant of learnability and scaling, plus the introduction of a new hierarchical classification without supporting empirical handles.

axioms (1)
  • domain assumption Expressibility, computability, and learnability are distinct properties of computational problems whose pairwise relationships can be formally established.
    This distinction underpins the five-level hierarchy and is used to explain differential scaling between code and RL tasks.
invented entities (1)
  • Five-level hierarchy of learnability no independent evidence
    purpose: To classify tasks by information structure and predict scaling reliability.
    Newly proposed classification system without independent falsifiable evidence outside the framework itself.

pith-pipeline@v0.9.0 · 5469 in / 1333 out tokens · 28633 ms · 2026-05-15T21:54:49.074493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 7 internal anchors

  1. [1]

    Aho, Monica S

    Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman.Compilers: Principles, Techniques, and Tools. Pearson, 2nd edition, 2006

  2. [2]

    Inductive inference of formal languages from positive data.Information and Control, 45(2):117–135, 1980

    Dana Angluin. Inductive inference of formal languages from positive data.Information and Control, 45(2):117–135, 1980

  3. [3]

    Language generation: Complexity barriers and implications for learning.arXiv preprint arXiv:2511.05759, 2025

    Marcelo Arenas, Pablo Barceló, Luis Cofré, and Alexander Kozachinskiy. Language generation: Complexity barriers and implications for learning.arXiv preprint arXiv:2511.05759, 2025

  4. [4]

    Non-stationary stochastic optimization

    Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015

  5. [5]

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.Journal of the ACM, 36(4):929–965, 1989

  6. [6]

    On the non-uniform generation of countable language collections.arXiv preprint arXiv:2411.15364, 2024

    Moses Charikar and Chirag Pabbaraju. On the non-uniform generation of countable language collections.arXiv preprint arXiv:2411.15364, 2024

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Zhihan Chu et al. Supervised fine-tuning vs. reinforcement learning for generalization.arXiv preprint arXiv:2501.17161, 2025. 22

  9. [9]

    An unsolvable problem of elementary number theory.American Journal of Mathematics, 58(2):345–363, 1936

    Alonzo Church. An unsolvable problem of elementary number theory.American Journal of Mathematics, 58(2):345–363, 1936

  10. [10]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui et al. Policy entropy exhaustion in language model reinforcement learning.arXiv preprint arXiv:2505.22617, 2025

  11. [11]

    Positional attention: Expressivity and learnability of algorithmic computation

    Artur Back de Luca, George Giapitzakis, Shenghao Yang, Petar Veličković, and Kimon Fountoulakis. Positional attention: Expressivity and learnability of algorithmic computation. arXiv preprint arXiv:2410.01686, 2025

  12. [12]

    The Lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. InProceedings of the 28th International Conference on Automated Deduction (CADE), pages 625–635, 2021

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    On Goodhart’s law, with an application to value alignment.arXiv preprint arXiv:2410.09638, 2024

    El-Mahdi El-Mhamdi and Lê-Nguyên Hoang. On Goodhart’s law, with an application to value alignment.arXiv preprint arXiv:2410.09638, 2024

  15. [15]

    Testing the manifold hypothesis

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016

  16. [16]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  17. [17]

    Mark Gold

    E. Mark Gold. Language identification in the limit.Information and Control, 10(5):447–474, 1967

  18. [18]

    Epiplexity: Structural information extractable by a computationally bounded observer.arXiv preprint arXiv:2507.02598, 2025

    Yiding Jiang, Soufiane Hayou, and Andrew Gordon Wilson. Epiplexity: Structural information extractable by a computationally bounded observer.arXiv preprint arXiv:2507.02598, 2025

  19. [19]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman...

  20. [20]

    On the limits of language generation: Trade-offs between hallucination and mode collapse

    Alkis Kalavasis, Anay Mehrotra, and Nikolas Velegkas. On the limits of language generation: Trade-offs between hallucination and mode collapse. InProceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC), 2025

  21. [21]

    Scaling Laws for Neural Language Models

    JaredKaplan, SamMcCandlish, TomHenighan, TomB.Brown, BenjaminChess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  22. [22]

    Goodhart’s law in reinforcement learning.arXiv preprint arXiv:2310.09144, 2024

    Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Skalse. Goodhart’s law in reinforcement learning.arXiv preprint arXiv:2310.09144, 2024. 23

  23. [23]

    Kiani, Jason Wang, and Melanie Weber

    Bobak T. Kiani, Jason Wang, and Melanie Weber. Hardness of learning neural networks under the manifold hypothesis.arXiv preprint arXiv:2406.01461, 2024

  24. [24]

    Language generation in the limit.arXiv preprint arXiv:2404.06757, 2024

    Jon Kleinberg and Sendhil Mullainathan. Language generation in the limit.arXiv preprint arXiv:2404.06757, 2024

  25. [25]

    Generation through the lens of learning theory

    Chinmay Li, Meghal Raman, and Ambuj Tewari. Generation through the lens of learning theory. InProceedings of the 38th Annual Conference on Learning Theory (COLT), 2025

  26. [26]

    Reasoning kingdom: Chapter 12 — implicit reasoning and the Yonglin formula, 2025

    Zixi Li. Reasoning kingdom: Chapter 12 — implicit reasoning and the Yonglin formula, 2025. https://datawhalechina.github.io/reasoning-kingdom/chapter12

  27. [27]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  28. [28]

    Caterini, and Jesse C

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L. Caterini, and Jesse C. Cresswell. Deep generative models through the lens of the manifold hypothesis: A survey and new connections.arXiv preprint arXiv:2404.02954, 2024

  29. [29]

    Categorizing Variants of Goodhart's Law

    David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s law.arXiv preprint arXiv:1803.04585, 2019

  30. [30]

    Performative prediction

    Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 7599–7609, 2020

  31. [31]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

  32. [32]

    Modeling by shortest data description.Automatica, 14(5):465–471, 1978

    Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

  33. [33]

    Dhruv Rohatgi and Dylan J. Foster. Necessary and sufficient oracles: Toward a computational taxonomy for reinforcement learning.arXiv preprint arXiv:2502.08632, 2025

  34. [34]

    Schapire

    Robert E. Schapire. The strength of weak learnability.Machine Learning, 5(2):197–227, 1990

  35. [35]

    Cambridge University Press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

  36. [36]

    Learnability, stability, and uniform convergence.Journal of Machine Learning Research, 11:2635–2670, 2010

    Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability, and uniform convergence.Journal of Machine Learning Research, 11:2635–2670, 2010

  37. [37]

    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the g...

  38. [38]

    LLMs versus the halting problem: Revisiting program termination prediction.arXiv preprint arXiv:2601.18987, 2025

    Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, and Peter O’Hearn. LLMs versus the halting problem: Revisiting program termination prediction.arXiv preprint arXiv:2601.18987, 2025. 24

  39. [39]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  40. [40]

    Competition on software verification and witness validation: Results

    SV-COMP. Competition on software verification and witness validation: Results. 2025.https: //sv-comp.sosy-lab.org/2025/results/

  41. [41]

    Alan M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, s2-42(1):230–265, 1936

  42. [42]

    David A. Turner. Total functional programming.Journal of Universal Computer Science, 10 (7):751–768, 2004

  43. [43]

    Leslie G. Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134–1142, 1984

  44. [44]

    Propositions as types

    Philip Wadler. Propositions as types. InCommunications of the ACM, volume 58, pages 75–84, 2015

  45. [45]

    Wang, Keir Dorchen, and Peter Jin

    Charles L. Wang, Keir Dorchen, and Peter Jin. On the statistical limits of self-improving agents.arXiv preprint arXiv:2510.04399, 2025

  46. [46]

    Texts in Theoretical Computer Science

    Klaus Weihrauch.Computable Analysis: An Introduction. Texts in Theoretical Computer Science. Springer, 2000

  47. [47]

    Exact phase transitions in random constraint satisfaction problems.Journal of Artificial Intelligence Research, 12:93–103, 2000

    Ke Xu and Wei Li. Exact phase transitions in random constraint satisfaction problems.Journal of Artificial Intelligence Research, 12:93–103, 2000

  48. [48]

    Stronger supervised checkpoints underperform after reinforcement learning

    Yao Zhang et al. Stronger supervised checkpoints underperform after reinforcement learning. arXiv preprint arXiv:2602.01058, 2026. 25 Table 7: Summary of notation. Symbol Meaning Spaces and sets XInstance space (countable set) ΣFinite alphabet Σ∗ Set of all finite strings overΣ Rd d-dimensional real space NNatural numbers X <∞ All finite sequences overX, ...