pith. machine review for the scientific record. sign in

arxiv: 2605.15156 · v1 · submitted 2026-05-14 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

MeMo: Memory as a Model

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords memory modelknowledge incorporationLLM updatingretrieval augmentationmodular architecturecatastrophic forgettingplug-and-playcross-document reasoning
0
0 comments X

The pith

A dedicated memory model encodes new knowledge so LLMs can use it without changing parameters or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models stay frozen after pretraining, so adding timely domain-specific information normally requires costly updates that risk overwriting prior skills. MeMo solves this by storing the new information inside a separate memory model that operates alongside the unchanged LLM. The memory model captures relationships that span multiple documents, stays effective when retrieval returns imperfect results, and keeps the LLM from forgetting earlier knowledge. Because the approach needs no access to the LLM weights or output scores, it works as a plug-in for both open models and closed commercial ones. On three question-answering benchmarks the method matches or exceeds prior techniques while keeping retrieval cost constant even as the knowledge collection grows large.

Core claim

MeMo is a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged, thereby capturing complex cross-document relationships, remaining robust to retrieval noise, avoiding catastrophic forgetting, and maintaining retrieval costs independent of corpus size at inference time.

What carries the argument

The dedicated memory model that stores encoded knowledge separately from the frozen LLM and supplies it at inference without internal access to the language model.

If this is right

  • LLM parameters remain fixed, so prior knowledge is not overwritten during updates.
  • Retrieval cost stays constant regardless of how large the external knowledge collection becomes.
  • The framework integrates directly with closed-source LLMs that expose no weights or logits.
  • Relationships spanning multiple documents are handled inside the memory model rather than at the LLM level.
  • Performance remains competitive on narrative, multi-hop, and browsing-style question answering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-cost retrieval property could support continuously growing knowledge bases in production systems where full retraining is impossible.
  • Real-time domains such as news or research archives might adopt this pattern to refresh information without service interruption.
  • The same separation could be combined with existing retrieval pipelines to handle extremely large or rapidly changing corpora.

Load-bearing premise

A separate memory model can accurately encode and retrieve complex cross-document knowledge without any access to the LLM's weights or output logits.

What would settle it

A controlled test on a benchmark rich in cross-document links and high retrieval noise where MeMo performance drops below methods that fine-tune the LLM or read its internal states.

Figures

Figures reproduced from arXiv: 2605.15156 by Alfred Wei Lun Leong, Alok Prakash, Armando Solar-Lezama, Arun Verma, Bryan Kian Hsiang Low, Daniela Rus, Nancy F. Chen, Ryan Wei Heng Quek, Sanghyuk Lee.

Figure 1
Figure 1. Figure 1: Overview of the training and inference pipeline of MEMO. During MEMORY model training (left), a frozen GENERATOR model transforms a target corpus into a reflection QA dataset via fact extraction, consolidation, verification, entity surfacing, and cross-document synthesis, which is then used to train a dedicated MEMORY model. During inference (right), the frozen EXECUTIVE model answers complex user queries … view at source ↗
Figure 2
Figure 2. Figure 2: Cost–accuracy trade-off on NarrativeQA when a second corpus arrives (K=2, MEM￾ORY model = Qwen2.5-14B-Instruct, 8×H100). Cumulative training cost is shown on the x-axis (one Qwen-14B SFT run takes ≈ 24 GPU-hours on a 640k-QA-pair corpus). Merging trains MEM￾ORY model only on the new corpus, costing X+Y ≈ 48 GPU-hours, while full retraining re-runs on the union, costing X+(X+Y ) ≈ 72 GPU-hours — a 33% savin… view at source ↗
Figure 3
Figure 3. Figure 3: BrowseComp-Plus accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NarrativeQA accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MuSiQue accuracy (%) vs. training epoch (Full SFT) for each [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
read the original abstract

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MeMo, a modular framework that trains a dedicated memory model to encode and retrieve new knowledge for LLMs while keeping the base LLM parameters frozen. It claims five concrete advantages over prior methods—capturing complex cross-document relationships, robustness to retrieval noise, avoidance of catastrophic forgetting, plug-and-play compatibility with both open and closed-source LLMs without requiring weight or logit access, and inference-time retrieval cost independent of corpus size—and reports experimental results on BrowseComp-Plus, NarrativeQA, and MuSiQue showing competitive or superior performance.

Significance. If the empirical results and ablations hold under scrutiny, the separation of a trainable memory model from the frozen LLM offers a practical route to timely knowledge updates that works for proprietary models and avoids forgetting; this could influence future work on modular retrieval-augmented systems.

major comments (1)
  1. [§4.3] §4.3, Table 3: The reported gains on MuSiQue (e.g., +4.8 F1 over the strongest baseline) are presented without standard deviations across runs or statistical significance tests; given the small absolute margins typical on this benchmark, it is unclear whether the advantage is robust or could be explained by hyperparameter tuning differences.
minor comments (2)
  1. [Abstract] Abstract: The abstract asserts 'strong performance' and lists five advantages but supplies no quantitative metrics or baseline names; including one or two key numbers would make the claim immediately verifiable.
  2. [§3.2] §3.2: The training objective for the memory model is described at a high level; an explicit loss equation (e.g., contrastive or reconstruction term) would clarify how cross-document relationships are explicitly encouraged.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive overall assessment of MeMo. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4.3] §4.3, Table 3: The reported gains on MuSiQue (e.g., +4.8 F1 over the strongest baseline) are presented without standard deviations across runs or statistical significance tests; given the small absolute margins typical on this benchmark, it is unclear whether the advantage is robust or could be explained by hyperparameter tuning differences.

    Authors: We acknowledge the referee's valid concern regarding the presentation of MuSiQue results in Table 3. The original experiments reported single-run performance due to computational resource limits. In the revised manuscript we will rerun the MuSiQue experiments across multiple random seeds, report mean and standard deviation, and include statistical significance tests (paired t-tests against the strongest baseline) to establish that the +4.8 F1 gain is robust and not explained by hyperparameter differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces MeMo as a modular architectural framework that encodes new knowledge into a separate memory model while freezing the LLM parameters. All listed advantages (cross-document capture, robustness to noise, no forgetting, plug-and-play compatibility, constant retrieval cost) are presented as direct consequences of this separation rather than quantities derived from fitted outputs or self-referential definitions. Validation rests on empirical results from three external benchmarks (BrowseComp-Plus, NarrativeQA, MuSiQue) with comparisons to existing methods; no equations, training objectives, or self-citations reduce the central claims to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of an independently trained memory model whose internal design and training procedure are not detailed in the abstract; several domain assumptions about knowledge encoding and retrieval are required.

free parameters (1)
  • memory model architecture and training hyperparameters
    Size, layers, and optimization settings of the dedicated memory model are necessarily chosen or tuned but are not reported in the abstract.
axioms (1)
  • domain assumption A dedicated memory model can capture complex cross-document relationships without access to LLM weights or logits
    This premise underpins the plug-and-play and robustness claims.
invented entities (1)
  • MeMo memory model no independent evidence
    purpose: To store and retrieve new knowledge separately from the frozen LLM
    A new modular component introduced by the framework.

pith-pipeline@v0.9.0 · 5511 in / 1327 out tokens · 59534 ms · 2026-05-15T03:13:50.182875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 16 internal anchors

  1. [1]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.arXiv:2205.11916, 2023

  2. [2]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv:2303.18223, 2023

  3. [3]

    A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghoon Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Method- ology, 2026

  4. [4]

    Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

    Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

  5. [5]

    Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

  6. [6]

    Smith, Yejin Choi, and Kentaro Inui

    Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?arXiv:2207.13332, 2024

  7. [7]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj...

  8. [8]

    BloombergGPT: A Large Language Model for Finance

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv:2303.17564, 2023

  9. [9]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv:2005.11401, 2021

  10. [10]

    & Raffel, C

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge.arXiv:2211.08411, 2023

  11. [11]

    Sustainable ai: Environmen- tal implications, challenges and opportunities

    Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmen- tal implications, challenges and opportunities. InProc. MLSys, pages 795–813, 2022

  12. [12]

    Robertson and Steve Walker

    Stephen E. Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. InProc. SIGIR, 1994

  13. [13]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv:2405.17428, 2024

  14. [14]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InProc. NeurIPS, pages 9459–9474, 2020

  15. [15]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv:2404.16130, 2024. 11

  16. [16]

    Hipporag: Neuro- biologically inspired long-term memory for large language models

    Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neuro- biologically inspired long-term memory for large language models. InProc. NeurIPS, pages 59532–59569, 2024

  17. [17]

    From rag to memory: Non-parametric continual learning for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models. InProc. ICML, 2025

  18. [18]

    Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InProc. NeurIPS, pages 1877–1901, 2020

  19. [19]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProc. EMNLP, 2024

  20. [20]

    MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

    Yixuan Tang and Yi Yang. MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

  21. [21]

    Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

    Jiaen Lin, Jingyu Liu, and Yingbo Liu. Optimizing multi-hop document retrieval through intermediate representations.arXiv:2503.04796, 2025

  22. [22]

    Continual pre-training of language models.arXiv:2302.03241, 2023

    Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv:2302.03241, 2023

  23. [23]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InProc. NeurIPS, 2022

  24. [24]

    Self-instruct: Aligning language models with self-generated instruc- tions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProc. ACL, pages 13484–13508, 2023

  25. [25]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, pages 1–53, 2024

  26. [26]

    An empirical study of catastrophic forgetting in large language models during continual fine- tuning, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.arXiv:2308.08747, 2025

  27. [27]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv:2501.17161, 2025

  28. [28]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProc. EMNLP, 2023

  29. [29]

    Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Proc. NeurIPS, 2023

  30. [30]

    In-context autoencoder for context compression in a large language model

    Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. InProc. ICLR, 2024

  31. [31]

    Memgen: Weaving generative latent memory for self-evolving agents

    Guibin Zhang, Muxin Fu, and Shuicheng Y AN. Memgen: Weaving generative latent memory for self-evolving agents. InProc. ICLR, 2026

  32. [32]

    Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

    Bohan Li, Yutai Hou, and Wanxiang Che. Data augmentation approaches in natural language processing: A survey.AI Open, pages 71–90, 2022

  33. [33]

    An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

    Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp.Transactions of the Association for Computational Linguistics, pages 191–211, 2023

  34. [34]

    Physics of language models: part 3.1, knowledge storage and extraction

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: part 3.1, knowledge storage and extraction. InProc. ICML, pages 1067–1077, 2024. 12

  35. [35]

    Synthetic qa corpora generation with roundtrip consistency

    Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic qa corpora generation with roundtrip consistency. InProc. ACL, pages 6168–6173, 2019

  36. [36]

    Training question answering models from synthetic data

    Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. Training question answering models from synthetic data. InProc. EMNLP, pages 5811–5826, 2020

  37. [37]

    Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InProc. ACL, pages 14664–14690, 2024

  38. [38]

    Self-training large language models through knowledge detection

    Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik Cambria. Self-training large language models through knowledge detection. InProc. EMNLP Findings, pages 15033–15045, 2024

  39. [39]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProc. NeurIPS, 2017

  40. [40]

    Scaling context requires rethinking attention.arXiv:2507.04239, 2025

    Carles Gelada, Jacob Buckman, Sean Zhang, and Txus Bach. Scaling context requires rethinking attention.arXiv:2507.04239, 2025

  41. [41]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  42. [42]

    RULER: What’s the real context size of your long-context language models? InProc

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InProc. COLM, 2024

  43. [43]

    The power of noise: Redefining retrieval for rag systems

    Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. InProc. SIGIR, 2024

  44. [44]

    Tackling the inherent difficulty of noise filtering in rag

    Jingyu Liu, Jiaen Lin, and Yong Liu. Tackling the inherent difficulty of noise filtering in rag. arXiv:2601.01896, 2026

  45. [45]

    Understanding the relationship between prompts and response uncertainty in large language models

    Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, and Bryan Kian Hsiang Low. Understanding the relationship between prompts and response uncertainty in large language models. InProc. ACL Findings, 2026

  46. [46]

    ERNIE 2.0: A continual pre-training framework for language understanding

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre-training framework for language understanding. InProc. AAAI, 2020

  47. [47]

    Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018

  48. [48]

    Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

    Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, and Ameya Prabhu. Mapping post- training forgetting in language models at scale.arXiv:2510.17776, 2025

  49. [49]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. ICLR, 2024

  50. [50]

    Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

    Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, et al. Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.arXiv:2311.03687, 2023

  51. [51]

    Understanding the performance and estimating the cost of llm fine-tuning

    Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Callie Hao, and Nishil Talati. Understanding the performance and estimating the cost of llm fine-tuning. InProc. IISWC, 2024

  52. [52]

    The open source advantage in large language models (llms).arXiv:2412.12004, 2025

    Jiya Manchanda, Laura Boettcher, Matheus Westphalen, and Jasser Jasser. The open source advantage in large language models (llms).arXiv:2412.12004, 2025. 13

  53. [53]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

  54. [54]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv:2307.08621, 2023

  55. [55]

    Rabe, DeLesley Hutchins, and Christian Szegedy

    Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. InProc. ICLR, 2022

  56. [56]

    General- ization through memorization: Nearest neighbor language models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. General- ization through memorization: Nearest neighbor language models. InProc. ICLR, 2020

  57. [58]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv:2309.12288, 2023

  58. [59]

    Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

  59. [60]

    Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 2024

  60. [61]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2...

  61. [62]

    langdetect.https://github.com/Mimino666/langdetect, 2021

    Michal Danilák. langdetect.https://github.com/Mimino666/langdetect, 2021

  62. [63]

    The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

    Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, pages 317–328, 2018

  63. [64]

    Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.arXiv:2108.00573, 2022

  64. [65]

    Cartridges: Lightweight and general- purpose long context representations via self-study

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv:2506.06266, 2025

  65. [66]

    Memory decoder: A pretrained, plug-and-play memory for large language models

    Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin. Memory decoder: A pretrained, plug-and-play memory for large language models. arXiv:2508.09874, 2025

  66. [67]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report.arXiv:2412.15115, 2025

  67. [68]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention.arXiv:2309.06180, 2023

  68. [69]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2023

  69. [70]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv:1711.05101, 2017. 14

  70. [71]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  71. [72]

    Gemini 3 flash model card

    Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025

  72. [73]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, et al. Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

  73. [74]

    deepeval

    Jeffrey Ip and Kritin V ongthongsri. deepeval. https://github.com/confident-ai/ deepeval, 2025

  74. [75]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  75. [76]

    LFM2 technical report.arXiv:2511.23404, 2025

    Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, et al. LFM2 technical report.arXiv:2511.23404, 2025

  76. [77]

    Ties-merging: Resolving interference when merging models

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InProc. NeurIPS, 2023

  77. [78]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998. 15

  78. [79]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  79. [80]

    Fine-tuning or retrieval? comparing knowledge injection in llms

    Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retrieval? comparing knowledge injection in llms. InProc. EMNLP, pages 237–250, 2024

  80. [81]

    Continual learning for large language models: A survey.arXiv:2402.01364, 2024

    Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv:2402.01364, 2024

Showing first 80 references.