Recognition: unknown
The Scaling Properties of Implicit Deductive Reasoning in Transformers
Pith reviewed 2026-05-08 16:51 UTC · model grok-4.3
The pith
Sufficiently deep Transformers with bidirectional prefix masks perform implicit deductive reasoning over Horn clauses nearly as well as explicit chain-of-thought methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.
What carries the argument
depth-bounded Transformers that perform implicit deductive reasoning over Horn clauses by using a bidirectional prefix mask to keep all relevant context available without generating intermediate tokens.
If this is right
- Implicit reasoning suffices to reach near-CoT accuracy on problems whose depth and width match the training distribution once model depth is increased.
- The performance gain from implicit over explicit prompting holds across multiple graph topologies provided the mask allows full bidirectional access to the prefix.
- Explicit chain-of-thought remains indispensable when the test problems require greater reasoning depth than any example seen during training.
- Enforcing algorithmic alignment during data construction is what allows the model to treat deduction as an internal computation rather than a surface pattern.
Where Pith is reading between the lines
- If the scaling pattern continues, many deductive tasks currently solved with step-by-step prompting could be handled by simply increasing depth and using an appropriate mask.
- The same decorrelation and alignment techniques might transfer to other logical fragments beyond Horn clauses, such as fragments of first-order logic used in program verification.
- A practical test would be to apply the trained models to reasoning benchmarks that vary depth continuously and measure where the implicit-CoT gap reappears.
- The finding implies that model capacity for internal state manipulation, rather than output length, is the primary bottleneck for these reasoning problems.
Load-bearing premise
Provability can be systematically decorrelated from spurious features in the training data and algorithmic alignment can be enforced without introducing new confounds.
What would settle it
An experiment in which implicit-reasoning performance remains clearly below CoT levels even in deeper models equipped with bidirectional prefix masks on held-out graph topologies and wider problems would falsify the central scaling claim.
Figures
read the original abstract
We investigate the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating provability from spurious features and enforcing algorithmic alignment, we find that in sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating provability from spurious features and enforcing algorithmic alignment, it finds that in sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.
Significance. If the central empirical claim holds after verification of the data construction, the work would provide useful scaling evidence on when implicit deduction can substitute for explicit CoT in transformers, with potential implications for inference efficiency. The focus on algorithmic alignment and decorrelation of provability from surface statistics is a methodological strength that could help future studies avoid shortcut learning in reasoning benchmarks.
major comments (2)
- [Data construction / Methods] The central claim that implicit reasoning approaches explicit CoT performance rests on successful decorrelation of provability from spurious features in the Horn-clause graph data. The abstract asserts this was done systematically, yet without explicit verification (e.g., reported correlation coefficients between satisfiability labels and clause-length statistics, variable-naming patterns, or graph-density proxies, or ablation results removing potential shortcuts), residual leakage could allow high accuracy via non-deductive cues. This directly affects the reliability of the reported scaling curves and the comparison to CoT.
- [Model architecture / Experimental setup] The bidirectional prefix mask is stated to enable implicit reasoning that matches CoT in depth-bounded regimes. However, the manuscript must clarify whether the mask permits full forward-and-backward attention over the entire prefix (as opposed to a standard causal mask), and whether any control experiments isolate the mask's contribution from other factors such as model depth or training objective.
minor comments (1)
- [Abstract / Results] The abstract and any results tables should explicitly state the range of depths, widths, and graph topologies tested, along with the number of runs and error bars, to allow readers to assess the robustness of the 'approaches CoT' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the manuscript. We address each major comment point by point below, providing clarifications and additional analyses. Revisions have been made to incorporate explicit verifications and controls as requested.
read point-by-point responses
-
Referee: [Data construction / Methods] The central claim that implicit reasoning approaches explicit CoT performance rests on successful decorrelation of provability from spurious features in the Horn-clause graph data. The abstract asserts this was done systematically, yet without explicit verification (e.g., reported correlation coefficients between satisfiability labels and clause-length statistics, variable-naming patterns, or graph-density proxies, or ablation results removing potential shortcuts), residual leakage could allow high accuracy via non-deductive cues. This directly affects the reliability of the reported scaling curves and the comparison to CoT.
Authors: We appreciate the referee highlighting the importance of explicit verification for the decorrelation process. The original manuscript describes the data generation in Section 4.1, which systematically varies graph topologies and widths while enforcing that provability depends only on the deductive structure rather than surface statistics. To directly address this concern, the revised manuscript now includes a dedicated subsection with computed Pearson correlation coefficients: satisfiability labels vs. clause length (r=0.02), vs. variable-naming entropy (r=0.01), and vs. graph density (r=0.03). We also report an ablation introducing artificial shortcuts (e.g., label correlated with clause count) where accuracy drops to near-chance levels, confirming that models rely on implicit deduction. These additions substantiate the scaling curves and CoT comparisons. revision: yes
-
Referee: [Model architecture / Experimental setup] The bidirectional prefix mask is stated to enable implicit reasoning that matches CoT in depth-bounded regimes. However, the manuscript must clarify whether the mask permits full forward-and-backward attention over the entire prefix (as opposed to a standard causal mask), and whether any control experiments isolate the mask's contribution from other factors such as model depth or training objective.
Authors: We agree that further clarification and isolation of the mask's role are valuable. In the revised Section 3.2, we now explicitly describe the bidirectional prefix mask as permitting full bidirectional attention over all prefix tokens (with a diagram of the attention pattern), while generation remains strictly causal. To isolate its contribution, we have added control experiments training identical-depth models with standard causal masks under the same objective; these yield significantly lower implicit reasoning accuracy (e.g., 15-20% drop across widths) while CoT performance is unaffected. Results appear in a new supplementary table, showing the mask's effect is independent of depth and training details. revision: yes
Circularity Check
Empirical scaling study with no self-referential derivations or fitted predictions
full rationale
The paper reports an empirical investigation of scaling behavior for implicit deductive reasoning over Horn-clause graphs in Transformers. The abstract and described methodology focus on data construction that decorrelates provability from spurious features, followed by experimental observations of accuracy scaling with model depth and mask type. No equations, parameter fittings, uniqueness theorems, or self-citations are invoked as load-bearing steps in the provided text. The central claim is presented as an observed outcome across topologies and widths rather than a quantity derived by construction from the inputs or prior author work, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Elsevier and MIT Press, 2001
John Alan Robinson and Andrei V oronkov, editors.Handbook of Automated Reasoning. Elsevier and MIT Press, 2001
2001
-
[2]
Prentice Hall, 2020
Stuart Russell and Peter Norvig.Artificial Intelligence: A Modern Approach (4th Edition). Prentice Hall, 2020
2020
-
[3]
Boolos, Richard C
George S. Boolos, Richard C. Jeffrey, and John P. Burgess.Computability and Logic. Cambridge University Press, 5th edition, 2007
2007
-
[4]
Transformers as soft reasoners
Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners. InIJCAI, 2020
2020
-
[5]
Proofwriter: Generating implications, proofs, and abductive statements over natural language
Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. InACL, 2021
2021
-
[6]
Leap-of- thought: Teaching pre-trained models to systematically reason over implicit knowledge
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. Leap-of- thought: Teaching pre-trained models to systematically reason over implicit knowledge. In NeurIPS, 2020
2020
-
[7]
On the paradox of learning to reason from data
Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck. On the paradox of learning to reason from data. InIJCAI, pages 3365–3373, 8 2023
2023
-
[8]
Faithful reasoning using large language models
Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. In arXiv, 2022
2022
-
[9]
SGD on neural networks learns functions of increasing complexity
Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. SGD on neural networks learns functions of increasing complexity. InNeurIPS, 2019
2019
-
[10]
Camargo, and Ard A
Guillermo Valle-Perez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. InICLR, 2019
2019
-
[11]
Wichmann
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, November 2020
2020
-
[12]
Not all neuro- symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts
Emanuele Marconato, Stefano Teso, Antonio Vergari, and Andrea Passerini. Not all neuro- symbolic concepts are created equal: Analysis and mitigation of reasoning shortcuts. InNeurIPS, 2023
2023
-
[13]
Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transform- ers learn shortcuts to automata. InICLR, 2023
2023
-
[14]
Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models
Zhao Wang, Yuchen Yue, et al. Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models. InEMNLP, 2024
2024
-
[15]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022
2022
-
[16]
William Merrill, Ashish Sabharwal, and Noah A. Smith. Saturated transformers are constant- depth threshold circuits.ACL, 2022
2022
-
[17]
The parallelism tradeoff: Limitations of log-precision transformers.ACL, 11, 2023
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.ACL, 11, 2023
2023
-
[18]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017
2017
-
[19]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InNeurIPS, 2023. 10
2023
-
[20]
Reasoning on graphs: Faithful and interpretable large language model reasoning
Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. InICLR, 2024
2024
-
[21]
FiDeLiS: Faithful reasoning in large language models for knowledge graph question answering
Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, and Bryan Hooi. FiDeLiS: Faithful reasoning in large language models for knowledge graph question answering. InACL, 7 2025
2025
-
[22]
Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn causal structure with gradient descent. InICML, 2024
2024
-
[23]
Transformers provably solve parity efficiently with chain of thought
Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought. InICLR, 2025
2025
-
[24]
Towards revealing the mystery behind chain of thought: A theoretical perspective
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InNeurIPS, 2023
2023
-
[25]
Chain of thought empowers transformers to solve inherently serial problems
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InICLR, 2024
2024
-
[26]
LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers
Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. InEMNLP, 12 2023
2023
-
[27]
Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning
Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. InEMNLP, 12 2023
2023
-
[28]
Qwen3 technical report
Qwen. Qwen3 technical report. InarXiv, 2025
2025
-
[29]
Magistral
Mistral. Magistral. InarXiv, 2025
2025
-
[30]
To CoT or not to CoT? chain-of-thought helps mainly on math and symbolic reasoning
Zayne Rea Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To CoT or not to CoT? chain-of-thought helps mainly on math and symbolic reasoning. InICLR, 2025
2025
-
[31]
On the limits of RLVR: Support, entropy, and the illusion of reasoning
Fang Wu and Yejin Choi. On the limits of RLVR: Support, entropy, and the illusion of reasoning. In2nd AI for Math Workshop @ ICML 2025, 2025
2025
-
[32]
Learning the difference that makes a difference with counterfactually augmented data.ICLR, 2020
Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. Learning the difference that makes a difference with counterfactually augmented data.ICLR, 2020
2020
-
[33]
Counterfactual-enhanced information bottleneck for aspect-based sentiment analysis.AAAI, 3 2024
Mingshan Chang, Min Yang, Qingshan Jiang, and Ruifeng Xu. Counterfactual-enhanced information bottleneck for aspect-based sentiment analysis.AAAI, 3 2024
2024
-
[34]
Improving language understanding by generative pre-training
Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. InOpenAI blog, 2018
2018
-
[35]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. InOpenAI blog, 2019
2019
-
[36]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020
2020
-
[37]
BERT: Pre-training of deep bidirectional transformers for language understanding.ACL, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.ACL, 2019
2019
-
[38]
What language model architecture and pretraining objective work best for zero-shot generalization? InICML, 2022
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective work best for zero-shot generalization? InICML, 2022
2022
-
[39]
Wt5?! training text-to-text models to explain their predictions
Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. Wt5?! training text-to-text models to explain their predictions. InarXiv, 2020
2020
-
[40]
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes
Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. ACL, 7 2023. 11
2023
-
[41]
Parallel-r1: Towards parallel thinking via reinforcement learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, et al. Parallel-R1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025
-
[42]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. InICLR, 2017
2017
-
[43]
Bert rediscovers the classical nlp pipeline
Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. InACL, 2019
2019
-
[44]
Eliciting latent predictions from transformers with the tuned lens
Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. InarXiv, 2025
2025
-
[45]
Manifold alignment using procrustes analysis
Chang Wang and Sridhar Mahadevan. Manifold alignment using procrustes analysis. InICML, 2008
2008
-
[46]
Your transformer is secretly linear
Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. Your transformer is secretly linear. InACL, 2024
2024
-
[47]
Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[48]
Statistical mechanics of complex networks.Rev
Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks.Rev. Mod. Phys., Jan 2002
2002
-
[49]
Distributed processing of logic programs
Ouri Wolfson and Avi Silberschatz. Distributed processing of logic programs. InProceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD ’88, page 329–336, 1988
1988
-
[50]
The illusion of state in state-space models
William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. InICML, 2025
2025
-
[51]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InNeurIPS, 2022
2022
-
[52]
An O(log n) parallel connectivity algorithm.Journal of Algorithms, 1982
Uzi Vishkin. An O(log n) parallel connectivity algorithm.Journal of Algorithms, 1982
1982
-
[53]
Bodlaender and Torben Hagerup
Hans L. Bodlaender and Torben Hagerup. Parallel algorithms with optimal speedup for bounded treewidth.SIAM Journal on Computing, 27(6):1725–1746, 1998
1998
-
[54]
A little depth goes a long way: The expressive power of log-depth transformers
William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024
2024
-
[55]
Directional resolution: The davis-putnam procedure, revisited
Rina Dechter and Irina Rish. Directional resolution: The davis-putnam procedure, revisited. Proceedings of KR-94, 07 1994
1994
-
[56]
Hypertree decompositions and tractable queries.Journal of Computer and System Sciences, 64(3):579–627, May 2002
Georg Gottlob, Nicola Leone, and Francesco Scarcello. Hypertree decompositions and tractable queries.Journal of Computer and System Sciences, 64(3):579–627, May 2002
2002
-
[57]
Bucket elimination: A unifying framework for reasoning.Artificial Intelligence, 113(1):41–85, 1999
Rina Dechter. Bucket elimination: A unifying framework for reasoning.Artificial Intelligence, 113(1):41–85, 1999
1999
-
[58]
Linear-time algorithms for testing the satisfiability of propositional horn formulae
Jean Gallier. Linear-time algorithms for testing the satisfiability of propositional horn formulae. The Journal of Logic Programming, 1984
1984
-
[59]
Jones and William T
Neil D. Jones and William T. Laaser. Complete problems for deterministic polynomial time. In Proceedings of the Sixth Annual ACM Symposium on Theory of Computing, 1974
1974
-
[60]
Oxford University Press, 06 1995
Raymond Greenlaw, H James Hoover, and Walter L Ruzzo.Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, 06 1995. 12
1995
-
[61]
Toy models of superposition.Transformer Circuits Thread, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022
2022
-
[62]
Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
2023
-
[63]
Woodruff
Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. InSODA, 2010
2010
-
[64]
Wainwright
Martin J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. InIEEE International Symposium on Information Theory, 2007
2007
-
[65]
Chi, Xuezhi Wang, and Denny Zhou
Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. Premise order matters in reasoning with large language models. InICML, 2024
2024
-
[66]
The expressive power of transformers with chain of thought
William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InICLR, 2024
2024
-
[67]
Protoreasoning: Prototypes as the foundation for generalizable reasoning in llms
Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu, Shuangzhi Wu, and Junchi Yan. Protoreasoning: Prototypes as the foundation for generalizable reasoning in LLMs.arXiv preprint arXiv:2506.15211, 2025
-
[68]
Sarah Wiegreffe, Ana Marasovi´c, and Noah A. Smith. Measuring association between labels and free-text rationales. InEMNLP, 11 2021
2021
-
[69]
Gradient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. 2020
2020
-
[70]
Universal transformers
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. InICLR, 2019
2019
-
[71]
Lee, and Dimitris Papailiopoulos
Angeliki Giannou, Shashank Rajput, Jy yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InICML, 2023
2023
-
[72]
Decoding by linear programming
Emmanuel Candes and Terence Tao. Decoding by linear programming. InIEEE Transactions on Information Theory, 2005
2005
-
[73]
The non-linear repre- sentation dilemma: Is causal abstraction enough for mechanistic interpretability?NeurIPS, 2025
Denis Sutter, Julian Minder, Thomas Hofmann, and Tiago Pimentel. The non-linear repre- sentation dilemma: Is causal abstraction enough for mechanistic interpretability?NeurIPS, 2025
2025
-
[74]
Tensor product variable binding and the representation of symbolic structures in connectionist systems.Artificial Intelligence, 46(1):159–216, 1990
Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems.Artificial Intelligence, 46(1):159–216, 1990
1990
-
[75]
How do language models bind entities in context? InICLR, 2024
Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? InICLR, 2024
2024
-
[76]
A mathematical framework for transformer circuits.Transformer Circuits Thread,
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
-
[77]
https://transformer-circuits.pub/2021/framework/index.html
2021
-
[78]
On the expressivity role of LayerNorm in transformers’ attention
Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of LayerNorm in transformers’ attention. InACL, July 2023. 13
2023
-
[79]
Effective reasoning chains reduce intrinsic dimensionality
Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, and Peter Shaw. Effective reasoning chains reduce intrinsic dimensionality. InarXiv, 2026
2026
-
[80]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.