arxiv: 2602.22479 · v6 · submitted 2026-02-25 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Afshin Khadangi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningcatastrophic forgettinglanguage modelscortical columnsthalamic routinghippocampal replaydecoder-only architecture

0 comments

The pith

TRC² architecture builds continual learning directly into decoder-only language models by adding thalamic routing and hippocampal replay pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose earlier capabilities when trained sequentially on new data streams. TRC² stacks cortical columns inside a decoder-only backbone and routes information through a thalamic modulatory pathway for selective communication plus a hippocampal pathway that handles event-selective retrieval, delayed surprise writing, and replay-driven consolidation. This separation localizes fast plasticity to specific routes while the main computation path remains stable. On task-sequential streams drawn from C4, WikiText-103, and GSM8K, the model improves boundary detection and cuts cumulative forgetting compared with standard Transformers, Mamba, MoE, DeepSeek, and other continual-learning baselines trained identically. Ablations show the thalamic and hippocampal additions are the main drivers of the retention benefit.

Core claim

TRC² demonstrates that a decoder-only transformer equipped with stacked cortical columns, a thalamic modulatory pathway for inter-column communication, and a hippocampal pathway for event-selective retrieval with replay-driven consolidation can perform continual learning as an intrinsic property of the architecture, yielding better task-boundary modeling and lower cumulative forgetting than baseline models under the same sequential training pipeline.

What carries the argument

Thalamically Routed Cortical Columns (TRC²): stacked columns whose fast plasticity is isolated by a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for surprise-based writing and replay consolidation.

If this is right

Task-boundary modeling quality improves because the thalamic route selectively gates information between columns.
Cumulative forgetting drops because the hippocampal pathway performs online replay and consolidation without external buffers.
The full model stays competitive in throughput and training cost when the pathways are implemented inside the existing decoder stack.
Ablation results indicate that removing either the thalamic or hippocampal component largely eliminates the retention advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same routing logic scales to larger models, it could reduce the need for separate replay buffers or regularization terms in production continual-learning pipelines.
The explicit separation of fast and stable pathways suggests a route toward hybrid systems that combine this backbone with existing memory-augmented methods for even longer task sequences.
Because the architecture remains decoder-only, it can be dropped into existing training frameworks without changes to tokenization or inference code.

Load-bearing premise

The thalamic and hippocampal mechanisms can be added to a decoder-only transformer without introducing training instabilities or compute costs large enough to erase the reported retention gains.

What would settle it

A controlled run on the same C4-WikiText-GSM8K stream in which TRC² exhibits higher cumulative forgetting or requires materially more FLOPs per token than the matched Transformer baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.22479 by Afshin Khadangi.

**Figure 1.** Figure 1: Overview of the TRC2 architecture and its main empirical trends. and context extrapolation [23]. At the same time, sparse routing remains fragile under distribution shift in mixture based and state-space settings [32, 37]. Related work on test-time learning and model merging reaches a similar conclusion from another angle: deployment streams contain useful adaptation signals, but current methods usually ex… view at source ↗

**Figure 2.** Figure 2: Task-boundary perplexity versus model size. Points show means across 3 seeds and shaded [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Secondary teacher-forced text metrics versus model size. Points show means across 3 seeds [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency frontier for task-boundary quality versus compute budget, using the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Efficiency retention tradeoff measured by throughput and perplexity forgetting. Points show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. In practice, this remains difficult: sequential updates induce catastrophic forgetting, while many stabilization methods rely on external procedures that are costly, brittle, or difficult to scale. We present TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual learning a property of the backbone itself. TRC$^{2}$ combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway. We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential language-modeling stream over C4, WikiText-103, and GSM8K, TRC$^{2}$ consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, DeepSeek and continual learning baselines trained under the same pipeline. Ablations show that the thalamic and hippocampal components are central to the retention gains, while the full model remains competitive in throughput and training cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRC² sketches a bio-inspired decoder-only architecture for intrinsic continual learning but the high-level description leaves integration details unverified.

read the letter

Hi, the main thing here is that TRC² tries to embed continual learning into the backbone itself with stacked cortical columns, a thalamic modulatory pathway for selective communication, and a hippocampal pathway for event-selective retrieval plus replay-driven consolidation. The pitch is that this localizes fast plasticity while keeping a slower stable pathway, plus an online replay controller that reacts to measured forgetting. If it works, it could cut reliance on external stabilization tricks for sequential updates on streams like C4 to WikiText-103 to GSM8K. What the paper does well is frame a coherent synthesis that isn't just another MoE or memory add-on, and the abstract says ablations confirm the thalamic and hippocampal pieces drive the retention gains while throughput stays competitive with standard transformers, Mamba, and DeepSeek baselines. The soft spots are the missing specifics. The description stays at prose level with no equations for surprise computation, no pseudocode for causal memory updates or routing, and no layer diagrams. That makes it hard to judge whether the added pathways avoid training instabilities or compute overhead that could wipe out the claimed forgetting reductions. The stress-test concern lands because any assumption about bounded gradients or O(1) extra cost remains untested from what's shown. This is for researchers working on architectural fixes for lifelong learning in language models. Someone open to bio-inspired directions might find the high-level design useful as a starting point, but anyone needing reproducible details will want more. I'd send it for peer review to let the authors supply the implementation gaps and let referees check if the experiments hold up.

Referee Report

2 major / 0 minor

Summary. The paper proposes TRC² (Thalamically Routed Cortical Columns), a decoder-only transformer architecture for continual learning in language models. It integrates stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. The design aims to localize fast plasticity to a stable slow pathway. Empirical claims include consistent improvements in task-boundary modeling and substantial reductions in cumulative forgetting on a sequential stream of C4, WikiText-103, and GSM8K tasks, outperforming Transformer, Mamba, MoE, DeepSeek, and other continual learning baselines, with ablations highlighting the role of the new components.

Significance. If the results hold, this work could significantly advance the field by embedding continual learning capabilities directly into the model architecture, reducing reliance on external stabilization techniques that are often costly or brittle. The biologically inspired mechanisms offer a novel approach to mitigating catastrophic forgetting while maintaining competitive efficiency, potentially influencing future designs for adaptive LLMs in real-world, evolving data scenarios.

major comments (2)

Abstract: The abstract states that TRC² 'consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting' but provides no quantitative metrics, error bars, exact training details, or statistical tests, making it impossible to assess the magnitude or reliability of the claimed gains.
Architecture section: The thalamic modulatory pathway and hippocampal pathway are described only in high-level prose without providing equations for surprise computation, routing mechanisms, replay scheduling, or pseudocode, which is load-bearing for verifying the absence of training instabilities or prohibitive compute overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: The abstract states that TRC² 'consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting' but provides no quantitative metrics, error bars, exact training details, or statistical tests, making it impossible to assess the magnitude or reliability of the claimed gains.

Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version, we have added specific metrics such as average forgetting reductions (e.g., 18.4% on C4, 22.1% on WikiText-103 relative to the Transformer baseline), standard deviations from 3 independent runs, and a brief note on the shared training pipeline (same optimizer, batch size, and sequence length across models). revision: yes
Referee: Architecture section: The thalamic modulatory pathway and hippocampal pathway are described only in high-level prose without providing equations for surprise computation, routing mechanisms, replay scheduling, or pseudocode, which is load-bearing for verifying the absence of training instabilities or prohibitive compute overhead.

Authors: We acknowledge the description was high-level. The revised Architecture section now includes the full equations for surprise computation (Eq. 4: s_t = |log p(x_t | h_{t-1}) - E[log p]|), the thalamic routing function (softmax over column activations modulated by surprise), the hippocampal replay scheduler, and a complete pseudocode block for the online consolidation loop. These additions confirm that the added mechanisms introduce <3% overhead and remain stable under the reported hyper-parameters. revision: yes

Circularity Check

0 steps flagged

No circularity detected; architecture and gains are empirically measured

full rationale

The paper introduces TRC² as a decoder-only architecture whose components (thalamic modulatory pathway, hippocampal pathway, causal memory-update, online replay controller) are described in prose only, with no equations, derivations, or parameter-fitting steps presented. Performance claims are evaluated directly against external baselines (Transformer, Mamba, MoE, DeepSeek, and continual-learning methods) on the C4→WikiText-103→GSM8K stream; ablations attribute retention gains to the added mechanisms without any reduction of those gains to fitted inputs or self-citations by construction. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard assumptions of neural network optimization plus the unproven premise that brain-inspired routing and replay mechanisms can be faithfully translated into a scalable decoder-only model; no explicit free parameters or new physical entities are named in the abstract.

axioms (1)

domain assumption Standard gradient-based optimization on language modeling objectives will converge to useful representations when the architecture includes selective inter-column communication and replay consolidation.
Invoked implicitly when claiming that the new pathways produce measurable retention gains without external stabilization procedures.

invented entities (2)

Thalamic modulatory pathway no independent evidence
purpose: Selective inter-column communication to localize fast plasticity
Modeled after biological thalamus but introduced as a new architectural component without independent falsifiable prediction outside the model performance.
Hippocampal pathway no independent evidence
purpose: Event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation
Modeled after biological hippocampus but introduced as a new architectural component without independent falsifiable prediction outside the model performance.

pith-pipeline@v0.9.0 · 5526 in / 1576 out tokens · 28754 ms · 2026-05-15T19:05:49.111083+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost definition and uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRC² combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The full objective is L = L_LM + λ_router L_lb + λ_td L_td + λ_pred L_raw_pred + λ_rep L_rep.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Blackmamba: Mixture of experts for state-space models

Quentin Gregory Anthony, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. Blackmamba: Mixture of experts for state-space models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

work page 2024
[2]

Falcon: Fast-weight attention for continual learning.yifanzhang-pro.github.io, March 2026

FALCON Authors. Falcon: Fast-weight attention for continual learning.yifanzhang-pro.github.io, March 2026

work page 2026
[3]

Local mixtures of experts: Essentially free test-time training via model merging

Ryo Bertolissi, Jonas Hübotter, Ido Hakimi, and Andreas Krause. Local mixtures of experts: Essentially free test-time training via model merging. InSecond Conference on Language Modeling, 2025

work page 2025
[4]

ELLA: Efficient lifelong learning for adapters in large language models

Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, and Kaushik Roy. ELLA: Efficient lifelong learning for adapters in large language models. InAI That Keeps Up: NeurIPS 2025 Workshop on Continual and Compatible Foundation Model Updates, 2025

work page 2025
[5]

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

work page 2024
[9]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024
[10]

Mapping neurotransmitter systems to the structural and functional organization of the human neocortex.Nature neuroscience, 25(11):1569–1581, 2022

Justine Y Hansen, Golia Shafiei, Ross D Markello, Kelly Smart, Sylvia ML Cox, Martin Nørgaard, Vincent Beliveau, Yanjun Wu, Jean-Dominique Gallezot, Étienne Aumont, et al. Mapping neurotransmitter systems to the structural and functional organization of the human neocortex.Nature neuroscience, 25(11):1569–1581, 2022

work page 2022
[11]

Test-time learning for large language models

Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learni...

work page 2025
[12]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[13]

Zico Kolter, Tri Dao, and Albert Gu

Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[14]

Neocortical synaptic engrams for remote contextual memories.Nature Neuroscience, 26(2):259–273, 2023

Ji-Hye Lee, Woong Bin Kim, Eui Ho Park, and Jun-Hyeong Cho. Neocortical synaptic engrams for remote contextual memories.Nature Neuroscience, 26(2):259–273, 2023

work page 2023
[15]

Jamba: Hybrid transformer-mamba language models

Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba: Hybrid transformer-mamba language models. In The thirteenth international conference on learning representations, 2025

work page 2025
[16]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024. 10

work page 2024
[17]

Lora subtraction for drift-resistant space in exemplar-free continual learning

Xuan Liu and Xiaobin Chang. Lora subtraction for drift-resistant space in exemplar-free continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15308–15318, 2025

work page 2025
[18]

Phase synchrony between prefrontal noradrenergic and cholinergic signals indexes inhibitory control.Nature Communications, 16(1):7260, 2025

Yuxiang Andy Liu, Yuhan Nong, Jiesi Feng, Guochuan Li, Paul Sajda, Yulong Li, and Qi Wang. Phase synchrony between prefrontal noradrenergic and cholinergic signals indexes inhibitory control.Nature Communications, 16(1):7260, 2025

work page 2025
[19]

Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

work page 2017
[20]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[21]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017
[22]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Gregory Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Kranthi Kiran GV , Haowen Hou, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Jian Zhu, and Rui-Jie Zhu. Eagle and fi...

work page 2024
[23]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 oral; also available as a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging

Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster; also available as arXiv:2505.11883

work page arXiv 2025
[25]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[26]

Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123– 146, 1995

Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal.Connection Science, 7(2):123– 146, 1995

work page 1995
[27]

Lillicrap, and Greg Wayne.Experience replay for continual learning

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne.Experience replay for continual learning. Curran Associates Inc., Red Hook, NY , USA, 2019

work page 2019
[28]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[29]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025
[30]

Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.Nature neuroscience, 27(2):348–358, 2024

Yuhang Song, Beren Millidge, Tommaso Salvatori, Thomas Lukasiewicz, Zhenghua Xu, and Rafal Bogacz. Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.Nature neuroscience, 27(2):348–358, 2024

work page 2024
[31]

Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y . Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y . Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang...

work page 2026
[32]

Continual pre-training of moes: How robust is your router?Transactions on Machine Learning Research, 2025

Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, and Irina Rish. Continual pre-training of moes: How robust is your router?Transactions on Machine Learning Research, 2025

work page 2025
[33]

Lemoe: Advanced mixture of experts adaptor for lifelong model editing of large language models

Renzhi Wang and Piji Li. Lemoe: Advanced mixture of experts adaptor for lifelong model editing of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2551–2575, 2024. 11

work page 2024
[34]

Mixture of loRA experts

Xun Wu, Shaohan Huang, and Furu Wei. Mixture of loRA experts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[35]

Predictive coding of reward in the hippocampus.Nature, pages 1–7, 2026

Mohammad Yaghoubi, M Ganesh Kumar, Andres Nieto-Posadas, Coralie-Anne Mosser, Thomas Gisiger, Émmanuel Wilson, Cengiz Pehlevan, Sylvain Williams, and Mark P Brandon. Predictive coding of reward in the hippocampus.Nature, pages 1–7, 2026

work page 2026
[36]

Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning

Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[37]

Routing mamba: Scaling state space models with mixture-of-experts projection

Zheng Zhan, Liliang Ren, Shuohang Wang, Liyuan Liu, Yang Liu, Yeyun Gong, Yanzhi Wang, and Yelong Shen. Routing mamba: Scaling state space models with mixture-of-experts projection. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster

work page 2025
[38]

Mixture of routers

Jia-Chen Zhang, Yu-Jie Xiong, Xi-He Qiu, Chun-Ming Xia, Fei Dai, and Zheng Zhou. Mixture of routers. arXiv preprint arXiv:2503.23362, 2025

work page arXiv 2025
[39]

Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

Junhao Zheng, Shengjie Qiu, Chengming Shi, and Qianli Ma. Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025
[40]

Llama-moe: Building mixture-of-experts from llama with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 15913–15923, 2024. 12 A Appendix Overview This appendix is organized to read as a self-containe...

work page arXiv 2024