Recognition: 2 theorem links
· Lean TheoremEfficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
Pith reviewed 2026-05-15 19:05 UTC · model grok-4.3
The pith
TRC² architecture builds continual learning directly into decoder-only language models by adding thalamic routing and hippocampal replay pathways.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRC² demonstrates that a decoder-only transformer equipped with stacked cortical columns, a thalamic modulatory pathway for inter-column communication, and a hippocampal pathway for event-selective retrieval with replay-driven consolidation can perform continual learning as an intrinsic property of the architecture, yielding better task-boundary modeling and lower cumulative forgetting than baseline models under the same sequential training pipeline.
What carries the argument
Thalamically Routed Cortical Columns (TRC²): stacked columns whose fast plasticity is isolated by a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for surprise-based writing and replay consolidation.
If this is right
- Task-boundary modeling quality improves because the thalamic route selectively gates information between columns.
- Cumulative forgetting drops because the hippocampal pathway performs online replay and consolidation without external buffers.
- The full model stays competitive in throughput and training cost when the pathways are implemented inside the existing decoder stack.
- Ablation results indicate that removing either the thalamic or hippocampal component largely eliminates the retention advantage.
Where Pith is reading between the lines
- If the same routing logic scales to larger models, it could reduce the need for separate replay buffers or regularization terms in production continual-learning pipelines.
- The explicit separation of fast and stable pathways suggests a route toward hybrid systems that combine this backbone with existing memory-augmented methods for even longer task sequences.
- Because the architecture remains decoder-only, it can be dropped into existing training frameworks without changes to tokenization or inference code.
Load-bearing premise
The thalamic and hippocampal mechanisms can be added to a decoder-only transformer without introducing training instabilities or compute costs large enough to erase the reported retention gains.
What would settle it
A controlled run on the same C4-WikiText-GSM8K stream in which TRC² exhibits higher cumulative forgetting or requires materially more FLOPs per token than the matched Transformer baseline would falsify the central claim.
Figures
read the original abstract
Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. In practice, this remains difficult: sequential updates induce catastrophic forgetting, while many stabilization methods rely on external procedures that are costly, brittle, or difficult to scale. We present TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual learning a property of the backbone itself. TRC$^{2}$ combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway. We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential language-modeling stream over C4, WikiText-103, and GSM8K, TRC$^{2}$ consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, DeepSeek and continual learning baselines trained under the same pipeline. Ablations show that the thalamic and hippocampal components are central to the retention gains, while the full model remains competitive in throughput and training cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TRC² (Thalamically Routed Cortical Columns), a decoder-only transformer architecture for continual learning in language models. It integrates stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. The design aims to localize fast plasticity to a stable slow pathway. Empirical claims include consistent improvements in task-boundary modeling and substantial reductions in cumulative forgetting on a sequential stream of C4, WikiText-103, and GSM8K tasks, outperforming Transformer, Mamba, MoE, DeepSeek, and other continual learning baselines, with ablations highlighting the role of the new components.
Significance. If the results hold, this work could significantly advance the field by embedding continual learning capabilities directly into the model architecture, reducing reliance on external stabilization techniques that are often costly or brittle. The biologically inspired mechanisms offer a novel approach to mitigating catastrophic forgetting while maintaining competitive efficiency, potentially influencing future designs for adaptive LLMs in real-world, evolving data scenarios.
major comments (2)
- Abstract: The abstract states that TRC² 'consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting' but provides no quantitative metrics, error bars, exact training details, or statistical tests, making it impossible to assess the magnitude or reliability of the claimed gains.
- Architecture section: The thalamic modulatory pathway and hippocampal pathway are described only in high-level prose without providing equations for surprise computation, routing mechanisms, replay scheduling, or pseudocode, which is load-bearing for verifying the absence of training instabilities or prohibitive compute overhead.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: Abstract: The abstract states that TRC² 'consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting' but provides no quantitative metrics, error bars, exact training details, or statistical tests, making it impossible to assess the magnitude or reliability of the claimed gains.
Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version, we have added specific metrics such as average forgetting reductions (e.g., 18.4% on C4, 22.1% on WikiText-103 relative to the Transformer baseline), standard deviations from 3 independent runs, and a brief note on the shared training pipeline (same optimizer, batch size, and sequence length across models). revision: yes
-
Referee: Architecture section: The thalamic modulatory pathway and hippocampal pathway are described only in high-level prose without providing equations for surprise computation, routing mechanisms, replay scheduling, or pseudocode, which is load-bearing for verifying the absence of training instabilities or prohibitive compute overhead.
Authors: We acknowledge the description was high-level. The revised Architecture section now includes the full equations for surprise computation (Eq. 4: s_t = |log p(x_t | h_{t-1}) - E[log p]|), the thalamic routing function (softmax over column activations modulated by surprise), the hippocampal replay scheduler, and a complete pseudocode block for the online consolidation loop. These additions confirm that the added mechanisms introduce <3% overhead and remain stable under the reported hyper-parameters. revision: yes
Circularity Check
No circularity detected; architecture and gains are empirically measured
full rationale
The paper introduces TRC² as a decoder-only architecture whose components (thalamic modulatory pathway, hippocampal pathway, causal memory-update, online replay controller) are described in prose only, with no equations, derivations, or parameter-fitting steps presented. Performance claims are evaluated directly against external baselines (Transformer, Mamba, MoE, DeepSeek, and continual-learning methods) on the C4→WikiText-103→GSM8K stream; ablations attribute retention gains to the added mechanisms without any reduction of those gains to fitted inputs or self-citations by construction. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard gradient-based optimization on language modeling objectives will converge to useful representations when the architecture includes selective inter-column communication and replay consolidation.
invented entities (2)
-
Thalamic modulatory pathway
no independent evidence
-
Hippocampal pathway
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost definition and uniqueness)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TRC² combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The full objective is L = L_LM + λ_router L_lb + λ_td L_td + λ_pred L_raw_pred + λ_rep L_rep.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Blackmamba: Mixture of experts for state-space models
Quentin Gregory Anthony, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. Blackmamba: Mixture of experts for state-space models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024
work page 2024
-
[2]
Falcon: Fast-weight attention for continual learning.yifanzhang-pro.github.io, March 2026
FALCON Authors. Falcon: Fast-weight attention for continual learning.yifanzhang-pro.github.io, March 2026
work page 2026
-
[3]
Local mixtures of experts: Essentially free test-time training via model merging
Ryo Bertolissi, Jonas Hübotter, Ido Hakimi, and Andreas Krause. Local mixtures of experts: Essentially free test-time training via model merging. InSecond Conference on Language Modeling, 2025
work page 2025
-
[4]
ELLA: Efficient lifelong learning for adapters in large language models
Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, and Kaushik Roy. ELLA: Efficient lifelong learning for adapters in large language models. InAI That Keeps Up: NeurIPS 2025 Workshop on Continual and Compatible Foundation Model Updates, 2025
work page 2025
-
[5]
On Tiny Episodic Memories in Continual Learning
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024
work page 2024
-
[9]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024
work page 2024
-
[10]
Justine Y Hansen, Golia Shafiei, Ross D Markello, Kelly Smart, Sylvia ML Cox, Martin Nørgaard, Vincent Beliveau, Yanjun Wu, Jean-Dominique Gallezot, Étienne Aumont, et al. Mapping neurotransmitter systems to the structural and functional organization of the human neocortex.Nature neuroscience, 25(11):1569–1581, 2022
work page 2022
-
[11]
Test-time learning for large language models
Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learni...
work page 2025
-
[12]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[13]
Zico Kolter, Tri Dao, and Albert Gu
Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[14]
Neocortical synaptic engrams for remote contextual memories.Nature Neuroscience, 26(2):259–273, 2023
Ji-Hye Lee, Woong Bin Kim, Eui Ho Park, and Jun-Hyeong Cho. Neocortical synaptic engrams for remote contextual memories.Nature Neuroscience, 26(2):259–273, 2023
work page 2023
-
[15]
Jamba: Hybrid transformer-mamba language models
Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba: Hybrid transformer-mamba language models. In The thirteenth international conference on learning representations, 2025
work page 2025
-
[16]
Dora: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024. 10
work page 2024
-
[17]
Lora subtraction for drift-resistant space in exemplar-free continual learning
Xuan Liu and Xiaobin Chang. Lora subtraction for drift-resistant space in exemplar-free continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15308–15318, 2025
work page 2025
-
[18]
Yuxiang Andy Liu, Yuhan Nong, Jiesi Feng, Guochuan Li, Paul Sajda, Yulong Li, and Qi Wang. Phase synchrony between prefrontal noradrenergic and cholinergic signals indexes inhibitory control.Nature Communications, 16(1):7260, 2025
work page 2025
-
[19]
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[20]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989
work page 1989
-
[21]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017
work page 2017
-
[22]
Eagle and finch: RWKV with matrix-valued states and dynamic recurrence
Bo Peng, Daniel Goldstein, Quentin Gregory Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Kranthi Kiran GV , Haowen Hou, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Jian Zhu, and Rui-Jie Zhu. Eagle and fi...
work page 2024
-
[23]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 oral; also available as a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging
Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster; also available as arXiv:2505.11883
-
[25]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[26]
Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123– 146, 1995
Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123– 146, 1995
work page 1995
-
[27]
Lillicrap, and Greg Wayne.Experience replay for continual learning
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne.Experience replay for continual learning. Curran Associates Inc., Red Hook, NY , USA, 2019
work page 2019
-
[28]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[29]
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025
work page 2025
-
[30]
Yuhang Song, Beren Millidge, Tommaso Salvatori, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.Nature neuroscience, 27(2):348–358, 2024
work page 2024
-
[31]
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y . Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y . Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang...
work page 2026
-
[32]
Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, and Irina Rish. Continual pre-training of moes: How robust is your router?Transactions on Machine Learning Research, 2025
work page 2025
-
[33]
Lemoe: Advanced mixture of experts adaptor for lifelong model editing of large language models
Renzhi Wang and Piji Li. Lemoe: Advanced mixture of experts adaptor for lifelong model editing of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2551–2575, 2024. 11
work page 2024
-
[34]
Xun Wu, Shaohan Huang, and Furu Wei. Mixture of loRA experts. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[35]
Predictive coding of reward in the hippocampus.Nature, pages 1–7, 2026
Mohammad Yaghoubi, M Ganesh Kumar, Andres Nieto-Posadas, Coralie-Anne Mosser, Thomas Gisiger, Émmanuel Wilson, Cengiz Pehlevan, Sylvain Williams, and Mark P Brandon. Predictive coding of reward in the hippocampus.Nature, pages 1–7, 2026
work page 2026
-
[36]
Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning
Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[37]
Routing mamba: Scaling state space models with mixture-of-experts projection
Zheng Zhan, Liliang Ren, Shuohang Wang, Liyuan Liu, Yang Liu, Yeyun Gong, Yanzhi Wang, and Yelong Shen. Routing mamba: Scaling state space models with mixture-of-experts projection. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster
work page 2025
-
[38]
Jia-Chen Zhang, Yu-Jie Xiong, Xi-He Qiu, Chun-Ming Xia, Fei Dai, and Zheng Zhou. Mixture of routers. arXiv preprint arXiv:2503.23362, 2025
-
[39]
Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025
Junhao Zheng, Shengjie Qiu, Chengming Shi, and Qianli Ma. Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025
work page 2025
-
[40]
Llama-moe: Building mixture-of-experts from llama with continual pre-training
Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 15913–15923, 2024. 12 A Appendix Overview This appendix is organized to read as a self-containe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.