pith. machine review for the scientific record. sign in

arxiv: 2605.07076 · v2 · submitted 2026-05-08 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:51 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords continual learningknowledge incorporationlanguage modelsmeta-reinforcement learningtransformer updatescontext consolidationlayer selection
0
0 comments X

The pith

Language models can learn to consolidate new context into their own weights by generating layer-specific update instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Consolidating Language Models to address how LLMs can incorporate streams of new information without losing prior knowledge. Instead of relying on longer context windows, SCoL trains the model to output instructions that update specific Transformer layers based on the current context. This is done using meta-reinforcement learning because each update changes the model for future contexts. Experiments on question answering and long-context tasks show better acquisition and retention than standard methods like prompting or sequential finetuning.

Core claim

SCoL is a post-training framework in which an LLM learns to generate textual update instructions for its own Transformer layers to write current context into model weights while limiting interference with previously consolidated information, trained with meta-reinforcement learning over an evolving model state, leading to improved performance on SQuAD and LongBench v2.

What carries the argument

Meta-reinforcement learning over an evolving model state to train generation of textual layer-update instructions that produce sparse updates aligned with high Fisher information layers.

Load-bearing premise

The model can learn to produce update instructions that cause stable changes without unintended interference as the model state changes over time.

What would settle it

Observing that performance on retaining earlier information drops significantly when processing a long sequence of new contexts with SCoL, compared to baselines.

Figures

Figures reproduced from arXiv: 2605.07076 by Anant Gupta, Christopher J. MacLellan, Zekun Wang, Zihan Dong.

Figure 1
Figure 1. Figure 1: Training pipeline of the Self-Consolidating Language Model (SCoL). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrations of experi￾ment setup under two reward set￾tings. Top: each context (SQuAD passages) comes with an environ￾ment reward (downstream QA). Bottom: QA is evaluated at the end of the context stream. Additional dataset details are provided in Section A.7. Policy, inner adaptation, and meta update. We choose the policy as a Qwen2.5-7B-Instruct model: at each context, the model is prompted to emit a l… view at source ↗
Figure 3
Figure 3. Figure 3: Left: immediate accuracy (top) and retention (bottom) across IPO updates. Top Right: top layer selection frequencies after the final IPO update, with the full reward variant in red and the acquisition only variant in blue. Bottom Right: examples of per-layer weight importance measured by Fisher information (brighter is higher) with SCoL’s selected layers marked by white stars. as a batch-access upper bound… view at source ↗
Figure 4
Figure 4. Figure 4: Selection prompt template. The model’s response is parsed deterministically: all integers in the output are extracted by regex, filtered to the range [0, L − 1], deduplicated preserving first-seen order, and truncated to the budget B. Each retained layer index is expanded to its seven projection matrices for adapter attachment. Malformed outputs that yield an empty parse produce an empty action, which in t… view at source ↗
Figure 5
Figure 5. Figure 5: Implication-generation prompt template, used unmodified from SEAL. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Judge prompt. A.6 Metric formulas All main-text metrics are summary statistics of the sequential-edit accuracy matrix M ∈ [0, 1]N×N , with N = 100 the length of the evaluation stream. Writing Mi,j for the accuracy on Qj after editing through ci (j ≤ i), we report Immediate acquisition = 1 N P i Mi,i, Retention = 1 N−1 P j<N−1 MN−1,j , A.7 Stream and round configuration [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 7
Figure 7. Figure 7: Per-layer change in selection count between the initial and final IPO updates. Layers are [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Summarization prompt used for Kimi-K2.6. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Layer selections for SCoL for the long context setting for short and long regimes. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Self-Consolidating Language Models (SCoL), a post-training framework in which an LLM generates textual instructions specifying which of its own Transformer layers to update given new context. Because updates alter the generator itself, training uses meta-reinforcement learning over an evolving model state. Experiments on SQuAD (supervised QA rewards) and LongBench v2 (intrinsic likelihood rewards) claim improved acquisition and retention relative to prompting, summarization, batch test-time training, and sequential finetuning. Post-hoc analysis indicates that learned selections are sparse and align with high-Fisher-information layers; the method also transfers from shorter meta-training streams to longer evaluation streams.

Significance. If the stability of the learned textual updates under evolving model state can be verified and the quantitative gains substantiated, the work would offer a practical route to continual knowledge incorporation in LLMs that selectively routes plasticity while limiting interference, with potential scalability to streaming settings.

major comments (2)
  1. [§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.
  2. [Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.
minor comments (2)
  1. [§2 (Method Overview)] The format and application procedure for the generated textual update instructions would be clearer with a concrete example (e.g., a sample instruction string and the corresponding layer-modification code) placed in the main text rather than only in the appendix.
  2. [Analysis of Learned Selection Patterns] Fisher-information alignment analysis would benefit from an explicit definition of how Fisher values are computed (e.g., which loss and which subset of parameters) to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications on update stability and committing to include the requested experimental details and analyses in the revision.

read point-by-point responses
  1. Referee: [§3 (Meta-RL Training Loop)] §3 (Meta-RL Training Loop): The central claim that SCoL improves retention over sequential finetuning rests on the assumption that generated layer-update instructions produce stable, non-interfering effects as the base model evolves. No ablations on update magnitude, sign consistency across sequential steps, or retention decay after repeated consolidation cycles are reported, leaving open the possibility that net effects drift or cancel prior consolidations.

    Authors: We agree that explicit verification of update stability is important to support the retention claims. Our LongBench v2 results demonstrate sustained gains over long sequential streams relative to finetuning baselines, which indirectly suggests non-interfering effects, but we did not report dedicated ablations on update magnitude, sign consistency, or multi-cycle decay. In the revised manuscript we will add these analyses, including norms of weight updates, consistency of selected layers across steps, and retention curves after repeated consolidation cycles. revision: yes

  2. Referee: [Experimental Results (SQuAD and LongBench v2 sections)] Experimental Results (SQuAD and LongBench v2 sections): The abstract and main claims assert performance gains in acquisition and retention, yet the manuscript supplies no concrete metrics, error bars, statistical significance tests, or detailed baseline hyper-parameters, preventing verification of the magnitude and reliability of the reported improvements.

    Authors: We apologize that the numerical results were not presented with sufficient prominence or detail in the submitted version. The full manuscript contains tables with exact metrics (F1 on SQuAD, likelihood/perplexity on LongBench v2), standard deviations from multiple runs, and statistical significance tests against baselines; all baseline hyperparameters are listed in the appendix. We will revise the main text and results sections to prominently display key numbers, error bars, and significance values, with explicit cross-references to the appendix for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external rewards and empirical baselines

full rationale

The paper's core method trains an LLM via meta-RL to output textual layer-update instructions, with rewards drawn from independent sources (supervised QA accuracy on SQuAD and intrinsic likelihood on LongBench v2). These rewards evaluate post-update performance rather than being fitted to the selection policy itself. No equations or steps reduce the claimed improvements to self-definition, renamed fits, or load-bearing self-citations. The reported gains are measured against external baselines (prompting, summarization, batch test-time training, sequential finetuning), and the Fisher-information alignment is presented as post-hoc analysis rather than a definitional input. The framework is therefore self-contained against verifiable external metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that layer-wise plasticity can be controlled via generated text instructions and that meta-RL can optimize selection policies over changing model states; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption LLMs possess differentiable layers whose updates can be selectively triggered by generated instructions
    Invoked as the basis for the update mechanism
  • domain assumption Meta-reinforcement learning can optimize policies over an evolving model state without instability
    Required for the training procedure described
invented entities (1)
  • SCoL update-instruction generator no independent evidence
    purpose: To produce textual directives for layer updates
    New component introduced by the framework

pith-pipeline@v0.9.0 · 5559 in / 1308 out tokens · 31847 ms · 2026-05-13T07:51:20.366059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  2. [2]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  3. [3]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020

  4. [4]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

  5. [5]

    Augmenting language models with long-term memory

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems, volume 36, 2023

  6. [6]

    Random-access infinite context length for trans- formers

    Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for trans- formers. InAdvances in Neural Information Processing Systems, volume 36, 2023

  7. [7]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore, 2023. Association for Computational Linguistics

  8. [8]

    LLMLingua: Com- pressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics

  9. [9]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

  10. [10]

    Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

  11. [11]

    McClelland, Bruce L

    James L. McClelland, Bruce L. McNaughton, and Randall C. O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory.Psychological Review, 102(3):419– 457, 1995

  12. [12]

    McClelland

    Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

  13. [13]

    The consolidation and transformation of memory

    Yadin Dudai, Avi Karni, and Jan Born. The consolidation and transformation of memory. Neuron, 88(1):20–32, 2015

  14. [14]

    Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, 1989. 10

  15. [15]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

  16. [16]

    van de Ven

    Gido M. van de Ven. On the computation of the fisher information in continual learning.arXiv preprint arXiv:2502.11756, 2025. To appear in the ICLR 2025 Blogpost Track

  17. [17]

    Self-adapting language models

    Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. InAdvances in Neural Information Processing Systems, 2025

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  19. [19]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics

  20. [20]

    LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  21. [21]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022

  22. [22]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. InInternational Conference on Learning Representations (ICLR), 2022

  23. [23]

    Wise: Rethinking the knowledge memory for lifelong model editing of large language models.Advances in Neural Information Processing Systems, 37:53764– 53797, 2024

    Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models.Advances in Neural Information Processing Systems, 37:53764– 53797, 2024

  24. [24]

    Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning

    Qizhou Chen, Taolin Zhang, Xiaofeng He, Dongyang Li, Chengyu Wang, Longtao Huang, and Hui Xue’. Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13565–13580, Miami, Florida, USA, 2024. Association for Computational Linguistics

  25. [25]

    Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, 2020. Association for Computational Linguistics

  26. [26]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  27. [27]

    Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026

    Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, and Xuemin Lin. Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026

  28. [28]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018. 11

  29. [29]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, Singapore, 2023. Association for Computational Linguistics

  30. [30]

    LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models

    Renzhi Wang and Piji Li. LEMoE: Advanced mixture of experts adaptor for lifelong model editing of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2551–2575, Miami, Florida, USA, 2024. Association for Computational Linguistics

  31. [31]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 2020

  32. [32]

    arXiv preprint arXiv:1911.05507 , year=

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019

  33. [33]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, New York, NY , USA, 2023. Association for Computing Machinery

  34. [34]

    Re- flexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  35. [35]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

  36. [36]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645:633–638, 2025

  37. [37]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InAdvances in Neural Information Processing Systems 30 (NIPS), 2017

  38. [38]

    arXiv preprint arXiv:2501.06252 , year=

    Qi Sun, Edoardo Cetin, and Yujin Tang. Transformer-squared: Self-adaptive llms.arXiv preprint arXiv:2501.06252, 2025

  39. [39]

    Meta-learning online adaptation of language models

    Nathan Hu, Eric Mitchell, Christopher Manning, and Chelsea Finn. Meta-learning online adaptation of language models. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 4418–4432, Singapore, 2023. Association for Computational Linguistics

  40. [40]

    Mass- editing memory in a transformer

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

  41. [41]

    Knowledge editing for large language models: A survey.arXiv preprint arXiv:2310.16218, 2024

    Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.arXiv preprint arXiv:2310.16218, 2024

  42. [42]

    Universal language model fine-tuning for text classi- fication

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 328–339, 2018

  43. [43]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. 12

  44. [44]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. InProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

  45. [45]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling.arXiv preprint arXiv:2308.08998, 2023

  46. [46]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  47. [47]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

  48. [48]

    A is B” fail to learn “B is A

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. InInternational Conference on Learning Representations, 2024

  49. [49]

    Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024

    Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning, 2024

  50. [50]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

  51. [51]

    Continual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3987–3995. PMLR, 2017

  52. [52]

    PackNet: Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

  53. [53]

    Overcoming catastrophic forgetting with hard attention to the task

    Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4548–4557. PMLR, 2018

  54. [54]

    Kimi k2.6 tech blog: Advancing open-source coding, 2026

    Moonshot AI. Kimi k2.6 tech blog: Advancing open-source coding, 2026. Accessed: 2026-05- 05

  55. [55]

    Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

  56. [56]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  57. [57]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

  58. [58]

    Attributing mode collapse in the fine-tuning of large language models

    Laura O’Mahony, Léo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024

  59. [59]

    Preserving diversity in supervised fine-tuning of large language models, 2025

    Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models.arXiv preprint arXiv:2408.16673, 2024. 13

  60. [60]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  61. [61]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...