pith. sign in

arxiv: 2606.09659 · v1 · pith:ZSPQKUWInew · submitted 2026-06-08 · 💻 cs.CL · cs.AI· cs.LG

End-to-End Context Compression at Scale

Pith reviewed 2026-06-27 16:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords context compressionencoder-decoderlong-context language modelsKV cachelatent embeddingsPareto frontieragent backbonescontinual pre-training
0
0 comments X

The pith

Encoder-decoder models called LCLMs compress long contexts at ratios of 4x to 16x while improving the trade-off among task performance, speed, and memory over KV cache methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains families of 0.6B-encoder and 4B-decoder models from scratch and then continually pre-trains them on hundreds of billions of tokens to learn how to map long token sequences into much shorter sequences of latent embeddings. These embeddings are then fed directly to a decoder for downstream tasks, bypassing the need to store or process the full original KV cache. The resulting LCLMs are shown to sit above prior compression techniques on the combined frontier of accuracy, compression speed, and peak memory at the three target ratios. The same models also function as backbones for agents that can skim the compressed representation and selectively expand only the needed segments. If the approach holds, long-context inference becomes feasible in production engines that previously could not accommodate either the memory cost or the extra compute of existing compressors.

Core claim

By architecture search followed by large-scale continual pre-training, the authors produce Latent Context Language Models that map an input sequence to a shorter latent sequence at fixed compression ratios and allow a decoder to perform general tasks from those latents alone, outperforming previous KV-cache and encoder-decoder compressors on the joint metrics of task accuracy, compression throughput, and memory footprint.

What carries the argument

The encoder-decoder compressor that converts a long token sequence into a shorter sequence of latent embeddings which the decoder consumes in place of the original tokens or KV cache.

If this is right

  • Long-context inference becomes practical in engines that cannot host the full KV cache or cannot afford the runtime cost of prior compressors.
  • Agents can use the compressed representation as a default view and request expansion of only the relevant segments on demand.
  • The same architecture works at 4x, 8x, and 16x compression without requiring the original prompt to fit inside the decoder's context window.
  • General-task performance remains competitive with uncompressed models while memory and speed improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the latent space proves stable across domains, the same compressor could be reused as a fixed front-end for many different decoder models rather than retraining per decoder.
  • Production systems might shift from ever-larger native context windows toward on-demand expansion from a compressed store.
  • The training recipe of architecture search plus continual pre-training on hundreds of billions of tokens could be applied to other compression ratios or to multimodal inputs.

Load-bearing premise

The encoder can be trained so that its latent embeddings contain enough recoverable information for the decoder to solve downstream tasks without ever seeing the uncompressed tokens, and this property continues to hold on data outside the pre-training distribution.

What would settle it

A controlled test in which LCLM accuracy on a held-out long-context task falls below both an uncompressed baseline and the best prior KV-cache compressor by more than a few percentage points at any of the three ratios.

read the original abstract

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that encoder-decoder context compressors, after architecture search and continual pre-training of 0.6B-encoder / 4B-decoder pairs on 350B tokens at 1:4, 1:8, and 1:16 ratios, yield Latent Context Language Models (LCLMs) that improve the Pareto frontier versus KV-cache baselines on general-task performance, compression speed, and peak memory; the models are further shown to support long-horizon agents that skim compressed contexts and expand segments on demand.

Significance. If the empirical gains hold with rigorous controls, the result would be significant: it supplies a scalable, production-compatible end-to-end alternative to KV-cache compression that does not require the full prompt to fit in the decoder window and demonstrates downstream utility for agentic workflows.

major comments (2)
  1. [Experimental results / Evaluation protocols] The central Pareto-frontier claim rests on the assertion that latents from the 0.6B encoder allow the 4B decoder to recover sufficient task-relevant information without the original tokens; the manuscript must supply quantitative evidence (accuracy deltas, baselines, error bars, and held-out task protocols) that this recovery generalizes beyond the 350B-token pre-training distribution, as the skeptic concern directly tests load-bearing validity of the reported gains.
  2. [Continual pre-training and agent experiments] The architecture-search and continual-pre-training sections need to clarify whether the reported speed/memory advantages are measured under identical inference-engine constraints and whether any degradation on long-horizon agent tasks is statistically significant relative to uncompressed or KV-cache baselines.
minor comments (2)
  1. [Introduction / Methods] Notation for compression ratios (1:4 etc.) and the precise definition of 'latent embeddings' consumed by the decoder should be stated explicitly in the methods to avoid ambiguity.
  2. [Figures and tables] Figure captions and tables reporting Pareto curves should include the exact number of runs, random seeds, and confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications drawn from the manuscript and commit to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: [Experimental results / Evaluation protocols] The central Pareto-frontier claim rests on the assertion that latents from the 0.6B encoder allow the 4B decoder to recover sufficient task-relevant information without the original tokens; the manuscript must supply quantitative evidence (accuracy deltas, baselines, error bars, and held-out task protocols) that this recovery generalizes beyond the 350B-token pre-training distribution, as the skeptic concern directly tests load-bearing validity of the reported gains.

    Authors: The manuscript already reports accuracy on a suite of general tasks (including those disjoint from the pre-training corpus) with explicit deltas versus KV-cache and other compression baselines at matched compression ratios. To directly address the request for rigor, we will expand the evaluation section to include error bars over multiple random seeds, a table of held-out task protocols with explicit train/test splits, and confirmation that no task overlap exists with the 350B-token pre-training data. These additions will be incorporated in the revised manuscript. revision: yes

  2. Referee: [Continual pre-training and agent experiments] The architecture-search and continual-pre-training sections need to clarify whether the reported speed/memory advantages are measured under identical inference-engine constraints and whether any degradation on long-horizon agent tasks is statistically significant relative to uncompressed or KV-cache baselines.

    Authors: The speed and memory measurements were obtained under identical inference-engine settings (same batching, same hardware, and same engine configuration) for LCLM and KV-cache baselines; we will add an explicit paragraph in the experimental setup subsection to document this. For the agent experiments, the manuscript reports mean performance across long-horizon tasks but does not include formal statistical tests. We will add confidence intervals and note whether observed differences reach significance; if the existing run data permit, we will include these in a revised table. Where data are insufficient for new tests, we will state the limitation clearly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical architecture search and pre-training

full rationale

The paper contains no equations, derivations, or mathematical claims. All results follow from training 0.6B-encoder/4B-decoder models on 350B tokens after architecture search and evaluating them on downstream tasks. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The central claims rest on external benchmarks and held-out evaluation rather than any reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on standard transformer training assumptions and empirical scaling rather than new theoretical primitives; the main added elements are the trained model weights themselves.

axioms (1)
  • domain assumption Transformer encoder and decoder blocks can be trained end-to-end to map long sequences to shorter latent sequences while preserving task-relevant information.
    Invoked throughout the architecture search and pre-training description.
invented entities (1)
  • Latent Context Language Models (LCLMs) no independent evidence
    purpose: Name for the proposed family of encoder-decoder compressors.
    New label for the trained models; no independent evidence beyond the training runs described.

pith-pipeline@v0.9.1-grok · 5854 in / 1370 out tokens · 26686 ms · 2026-06-27T16:34:42.342826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 2 canonical work pages

  1. [1]

    Longhealth: A question answering benchmark with long clinical documents

    Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander L \"o ser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents. Journal of Healthcare Informatics Research, 9 0 (3): 0 280--296, 2025

  2. [2]

    gpt-oss-120b & gpt-oss-20b model card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  3. [3]

    Nextcoder: Robust adaptation of code LM s to diverse code edits

    Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, and Nagarajan Natarajan. Nextcoder: Robust adaptation of code LM s to diverse code edits. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=3B6fF1PxYD

  4. [4]

    Why does the effective context length of LLM s fall short? In The Thirteenth International Conference on Learning Representations, 2025

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of LLM s fall short? In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eoln5WgrPx

  5. [5]

    Claude Code , 2025

    Anthropic . Claude Code , 2025. URL https://github.com/anthropics/claude-code

  6. [6]

    LongBench:

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. L ong B ench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  7. [7]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning

    Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848, 2025

  8. [8]

    Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition

    Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, and Flora Salim. Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition. arXiv preprint arXiv:2503.07259, 2025

  9. [9]

    Awesome-kv-cache-compression

    Longze Chen. Awesome-kv-cache-compression. GitHub repository, 2023. URL https://github.com/October2001/Awesome-KV-Cache-Compression

  10. [10]

    xrag: Extreme context compression for retrieval-augmented generation with one token

    Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. Advances in Neural Information Processing Systems, 37: 0 109487--109516, 2024

  11. [11]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023

  12. [12]

    Conditional positional encodings for vision transformers

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021

  13. [13]

    Learning to compress prompt in natural language formats

    Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, and Xia Hu. Learning to compress prompt in natural language formats. arXiv preprint arXiv:2402.18700, 2024

  14. [14]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  15. [15]

    Bruni, F

    Domenico Cotroneo, Giuseppe De Rosa, and Pietro Liguori. Pyresbugs: A dataset of residual python bugs for natural language-driven fault injection. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 146--150, 2025. doi:10.1109/Forge66646.2025.00024

  16. [16]

    Pretraining context compressor for large language models with embedding-based memory

    Yuhong Dai, Jianxun Lian, Yitian Huang, Wei Zhang, Mingyang Zhou, Mingqi Wu, Xing Xie, and Hao Liao. Pretraining context compressor for large language models with embedding-based memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28715--28732, 2025

  17. [17]

    Transformer-xl: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978--2988, 2019

  18. [18]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  19. [19]

    Gemma 4: Open lightweight language models

    Google DeepMind. Gemma 4: Open lightweight language models. 2026. URL https://ai.google.dev/gemma

  20. [20]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. 2026

  21. [21]

    Expected attention: Kv cache compression by estimating attention from future queries distribution

    Alessio Devoto, Maximilian Jeblick, and Simon J \'e gou. Expected attention: Kv cache compression by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025

  22. [22]

    Flex attention: A programming model for generating optimized attention kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496, 2 0 (3): 0 4, 2024

  23. [23]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  24. [24]

    Cartridges: Lightweight and general-purpose long context representations via self-study

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv preprint arXiv:2506.06266, 2025

  25. [25]

    Simple context compression: Mean-pooling and multi-ratio training

    Yair Feldman and Yoav Artzi. Simple context compression: Mean-pooling and multi-ratio training. arXiv preprint arXiv:2510.20797, 2025

  26. [26]

    How to train long-context language models (effectively)

    Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7376--7399, 2025

  27. [27]

    In-context autoencoder for context compression in a large language model

    Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023

  28. [28]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

  29. [29]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  30. [30]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  31. [31]

    Why mean pooling works: Quantifying second-order collapse in text embeddings

    Tomomasa Hara, Hiroto Kurita, Masaaki Imaizumi, Kentaro Inui, and Sho Yokoi. Why mean pooling works: Quantifying second-order collapse in text embeddings. arXiv preprint arXiv:2604.27398, 2026

  32. [32]

    Scaling instruction-tuned llms to million-token contexts via hierarchical synthetic data generation

    Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, and Ce Zhang. Scaling instruction-tuned llms to million-token contexts via hierarchical synthetic data generation. arXiv preprint arXiv:2504.12637, 2025

  33. [33]

    Gaussian error linear units (gelus)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  34. [34]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems, 37: 0 1270--1303, 2024

  35. [35]

    Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

  36. [36]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023

  37. [37]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658--1677, 2024

  38. [38]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567--2577, 2019

  39. [39]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  40. [40]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

  41. [41]

    Kvzip: Query-agnostic kv cache compression with context reconstruction

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction. arXiv preprint arXiv:2505.23416, 2025

  42. [42]

    Fast kvzip: Efficient and accurate llm inference with gated kv eviction

    Jang-Hyun Kim, Dongyoon Han, and Sangdoo Yun. Fast kvzip: Efficient and accurate llm inference with gated kv eviction. arXiv preprint arXiv:2601.17668, 2026

  43. [43]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626, 2023

  44. [44]

    Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  45. [45]

    Revisiting catastrophic forgetting in large language model tuning

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. In Findings of the association for computational linguistics: EMNLP 2024, pages 4297--4308, 2024 a

  46. [46]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

  47. [47]

    Compressing context to enhance inference efficiency of large language models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 6342--6353, 2023

  48. [48]

    Scbench: A kv cache-centric analysis of long-context methods

    Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Scbench: A kv cache-centric analysis of long-context methods. arXiv preprint arXiv:2412.10319, 2024 b

  49. [49]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37: 0 22947--22970, 2024 c

  50. [50]

    500xcompressor: Generalized prompt compression for large language models

    Zongqian Li, Yixuan Su, and Nigel Collier. 500xcompressor: Generalized prompt compression for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25081--25091, 2025

  51. [51]

    E2llm: Encoder elongated large language models for long-context understanding and reasoning

    Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, and Wei Zhang. E2llm: Encoder elongated large language models for long-context understanding and reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19212--19241, 2025

  52. [52]

    Refrag: Rethinking rag based decoding

    Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, and Vijai Mohan. Refrag: Rethinking rag based decoding. arXiv preprint arXiv:2509.01092, 2025

  53. [53]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

  54. [54]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36: 0 34892--34916, 2023 a

  55. [55]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12: 0 157--173, 2024 b

  56. [56]

    Repobench: Benchmarking repository-level code auto-completion systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023 b

  57. [57]

    Rag-instruct: Boosting llms with diverse retrieval-augmented instructions, 2024 c

    Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, and Benyou Wang. Rag-instruct: Boosting llms with diverse retrieval-augmented instructions, 2024 c . URL https://arxiv.org/abs/2501.00353

  58. [58]

    Chatqa: Building gpt-4 level conversational qa models

    Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa: Building gpt-4 level conversational qa models. CoRR, 2024 d

  59. [59]

    Starcoder 2 and the stack v2: The next generation, 2024

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

  60. [60]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE Transactions on Audio, Speech and Language Processing, 2025

  61. [61]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36: 0 19327--19352, 2023

  62. [62]

    Octopack: Instruction tuning code large language models

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023

  63. [63]

    Nemotron-Post-Training-Dataset-v2 , aug 2025 a

    Dhruv Nathawani, Shuoyang Ding, Vitaly Lavrukhin, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Boris Ginsburg, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v2 , aug 2025 a . URL https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2

  64. [64]

    Nemotron-Post-Training-Dataset-v1 , July 2025 b

    Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1 , July 2025 b . URL https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1

  65. [65]

    NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...

  66. [66]

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shan...

  67. [67]

    Introducing codex, May 2025

    OpenAI . Introducing codex, May 2025. URL https://openai.com/index/introducing-codex/. Accessed: 2026-01-09

  68. [68]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards LLM s as operating systems, 2023. URL https://arxiv.org/abs/2310.08560

  69. [69]

    Finewiki, 2025

    Guilherme Penedo. Finewiki, 2025. URL https://huggingface.co/datasets/HuggingFaceFW/finewiki. Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors

  70. [70]

    Yarn: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023

  71. [71]

    Arc-encoder: learning compressed text representations for large language models

    Hippolyte Pilchen, Edouard Grave, and Patrick P \'e rez. Arc-encoder: learning compressed text representations for large language models. arXiv preprint arXiv:2510.20535, 2025

  72. [72]

    Qwen3-VL

    Qwen Team . Qwen3-VL . https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list, 2025. Technical report

  73. [73]

    Compressive transformers for long-range sequence modelling

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019

  74. [74]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  75. [75]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  76. [76]

    Exploiting sparsity for long context inference: Million token contexts on commodity gpus

    Ryan Synk, Monte Hoover, John Kirchenbauer, Neel Jain, Alex Stein, Manli Shu, Josue Melendez Sanchez, Ramani Duraiswami, and Tom Goldstein. Exploiting sparsity for long context inference: Million token contexts on commodity gpus. arXiv preprint arXiv:2502.06766, 2025

  77. [77]

    Lloco: Learning long contexts offline

    Sijun Tan, Xiuyu Li, Shishir G Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E Gonzalez, and Raluca Ada Popa. Lloco: Learning long contexts offline. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17605--17621, 2024

  78. [78]

    Gmsa: Enhancing context compression via group merging and layer semantic alignment

    Jiwei Tang, Zhicheng Zhang, Shunlong Wu, Jingheng Ye, Lichen Bai, Zitai Wang, Tingwei Lu, Jiaqi Chen, Lin Hai, Hai-Tao Zheng, et al. Gmsa: Enhancing context compression via group merging and layer semantic alignment. arXiv preprint arXiv:2505.12215, 2025

  79. [79]

    Kimi linear: An expressive, efficient attention architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025

  80. [80]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37: 0 87310--87356, 2024

Showing first 80 references.