End-to-End Context Compression at Scale
Pith reviewed 2026-06-27 16:34 UTC · model grok-4.3
The pith
Encoder-decoder models called LCLMs compress long contexts at ratios of 4x to 16x while improving the trade-off among task performance, speed, and memory over KV cache methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By architecture search followed by large-scale continual pre-training, the authors produce Latent Context Language Models that map an input sequence to a shorter latent sequence at fixed compression ratios and allow a decoder to perform general tasks from those latents alone, outperforming previous KV-cache and encoder-decoder compressors on the joint metrics of task accuracy, compression throughput, and memory footprint.
What carries the argument
The encoder-decoder compressor that converts a long token sequence into a shorter sequence of latent embeddings which the decoder consumes in place of the original tokens or KV cache.
If this is right
- Long-context inference becomes practical in engines that cannot host the full KV cache or cannot afford the runtime cost of prior compressors.
- Agents can use the compressed representation as a default view and request expansion of only the relevant segments on demand.
- The same architecture works at 4x, 8x, and 16x compression without requiring the original prompt to fit inside the decoder's context window.
- General-task performance remains competitive with uncompressed models while memory and speed improve.
Where Pith is reading between the lines
- If the latent space proves stable across domains, the same compressor could be reused as a fixed front-end for many different decoder models rather than retraining per decoder.
- Production systems might shift from ever-larger native context windows toward on-demand expansion from a compressed store.
- The training recipe of architecture search plus continual pre-training on hundreds of billions of tokens could be applied to other compression ratios or to multimodal inputs.
Load-bearing premise
The encoder can be trained so that its latent embeddings contain enough recoverable information for the decoder to solve downstream tasks without ever seeing the uncompressed tokens, and this property continues to hold on data outside the pre-training distribution.
What would settle it
A controlled test in which LCLM accuracy on a held-out long-context task falls below both an uncompressed baseline and the best prior KV-cache compressor by more than a few percentage points at any of the three ratios.
read the original abstract
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that encoder-decoder context compressors, after architecture search and continual pre-training of 0.6B-encoder / 4B-decoder pairs on 350B tokens at 1:4, 1:8, and 1:16 ratios, yield Latent Context Language Models (LCLMs) that improve the Pareto frontier versus KV-cache baselines on general-task performance, compression speed, and peak memory; the models are further shown to support long-horizon agents that skim compressed contexts and expand segments on demand.
Significance. If the empirical gains hold with rigorous controls, the result would be significant: it supplies a scalable, production-compatible end-to-end alternative to KV-cache compression that does not require the full prompt to fit in the decoder window and demonstrates downstream utility for agentic workflows.
major comments (2)
- [Experimental results / Evaluation protocols] The central Pareto-frontier claim rests on the assertion that latents from the 0.6B encoder allow the 4B decoder to recover sufficient task-relevant information without the original tokens; the manuscript must supply quantitative evidence (accuracy deltas, baselines, error bars, and held-out task protocols) that this recovery generalizes beyond the 350B-token pre-training distribution, as the skeptic concern directly tests load-bearing validity of the reported gains.
- [Continual pre-training and agent experiments] The architecture-search and continual-pre-training sections need to clarify whether the reported speed/memory advantages are measured under identical inference-engine constraints and whether any degradation on long-horizon agent tasks is statistically significant relative to uncompressed or KV-cache baselines.
minor comments (2)
- [Introduction / Methods] Notation for compression ratios (1:4 etc.) and the precise definition of 'latent embeddings' consumed by the decoder should be stated explicitly in the methods to avoid ambiguity.
- [Figures and tables] Figure captions and tables reporting Pareto curves should include the exact number of runs, random seeds, and confidence intervals.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications drawn from the manuscript and commit to revisions that strengthen the presentation of our results without altering the core claims.
read point-by-point responses
-
Referee: [Experimental results / Evaluation protocols] The central Pareto-frontier claim rests on the assertion that latents from the 0.6B encoder allow the 4B decoder to recover sufficient task-relevant information without the original tokens; the manuscript must supply quantitative evidence (accuracy deltas, baselines, error bars, and held-out task protocols) that this recovery generalizes beyond the 350B-token pre-training distribution, as the skeptic concern directly tests load-bearing validity of the reported gains.
Authors: The manuscript already reports accuracy on a suite of general tasks (including those disjoint from the pre-training corpus) with explicit deltas versus KV-cache and other compression baselines at matched compression ratios. To directly address the request for rigor, we will expand the evaluation section to include error bars over multiple random seeds, a table of held-out task protocols with explicit train/test splits, and confirmation that no task overlap exists with the 350B-token pre-training data. These additions will be incorporated in the revised manuscript. revision: yes
-
Referee: [Continual pre-training and agent experiments] The architecture-search and continual-pre-training sections need to clarify whether the reported speed/memory advantages are measured under identical inference-engine constraints and whether any degradation on long-horizon agent tasks is statistically significant relative to uncompressed or KV-cache baselines.
Authors: The speed and memory measurements were obtained under identical inference-engine settings (same batching, same hardware, and same engine configuration) for LCLM and KV-cache baselines; we will add an explicit paragraph in the experimental setup subsection to document this. For the agent experiments, the manuscript reports mean performance across long-horizon tasks but does not include formal statistical tests. We will add confidence intervals and note whether observed differences reach significance; if the existing run data permit, we will include these in a revised table. Where data are insufficient for new tests, we will state the limitation clearly. revision: partial
Circularity Check
No circularity: purely empirical architecture search and pre-training
full rationale
The paper contains no equations, derivations, or mathematical claims. All results follow from training 0.6B-encoder/4B-decoder models on 350B tokens after architecture search and evaluating them on downstream tasks. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The central claims rest on external benchmarks and held-out evaluation rather than any reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer encoder and decoder blocks can be trained end-to-end to map long sequences to shorter latent sequences while preserving task-relevant information.
invented entities (1)
-
Latent Context Language Models (LCLMs)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Longhealth: A question answering benchmark with long clinical documents
Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander L \"o ser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents. Journal of Healthcare Informatics Research, 9 0 (3): 0 280--296, 2025
2025
-
[2]
gpt-oss-120b & gpt-oss-20b model card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025
Pith/arXiv arXiv 2025
-
[3]
Nextcoder: Robust adaptation of code LM s to diverse code edits
Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, and Nagarajan Natarajan. Nextcoder: Robust adaptation of code LM s to diverse code edits. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=3B6fF1PxYD
2025
-
[4]
Why does the effective context length of LLM s fall short? In The Thirteenth International Conference on Learning Representations, 2025
Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of LLM s fall short? In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eoln5WgrPx
2025
-
[5]
Claude Code , 2025
Anthropic . Claude Code , 2025. URL https://github.com/anthropics/claude-code
2025
-
[6]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. L ong B ench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...
-
[7]
Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848, 2025
arXiv 2025
-
[8]
Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition
Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, and Flora Salim. Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition. arXiv preprint arXiv:2503.07259, 2025
Pith/arXiv arXiv 2025
-
[9]
Awesome-kv-cache-compression
Longze Chen. Awesome-kv-cache-compression. GitHub repository, 2023. URL https://github.com/October2001/Awesome-KV-Cache-Compression
2023
-
[10]
xrag: Extreme context compression for retrieval-augmented generation with one token
Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. Advances in Neural Information Processing Systems, 37: 0 109487--109516, 2024
2024
-
[11]
Adapting language models to compress contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023
arXiv 2023
-
[12]
Conditional positional encodings for vision transformers
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021
arXiv 2021
-
[13]
Learning to compress prompt in natural language formats
Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, and Xia Hu. Learning to compress prompt in natural language formats. arXiv preprint arXiv:2402.18700, 2024
arXiv 2024
-
[14]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[15]
Domenico Cotroneo, Giuseppe De Rosa, and Pietro Liguori. Pyresbugs: A dataset of residual python bugs for natural language-driven fault injection. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 146--150, 2025. doi:10.1109/Forge66646.2025.00024
-
[16]
Pretraining context compressor for large language models with embedding-based memory
Yuhong Dai, Jianxun Lian, Yitian Huang, Wei Zhang, Mingyang Zhou, Mingqi Wu, Xing Xie, and Hao Liao. Pretraining context compressor for large language models with embedding-based memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28715--28732, 2025
2025
-
[17]
Transformer-xl: Attentive language models beyond a fixed-length context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978--2988, 2019
2019
-
[18]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023
Pith/arXiv arXiv 2023
-
[19]
Gemma 4: Open lightweight language models
Google DeepMind. Gemma 4: Open lightweight language models. 2026. URL https://ai.google.dev/gemma
2026
-
[20]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. 2026
2026
-
[21]
Expected attention: Kv cache compression by estimating attention from future queries distribution
Alessio Devoto, Maximilian Jeblick, and Simon J \'e gou. Expected attention: Kv cache compression by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025
arXiv 2025
-
[22]
Flex attention: A programming model for generating optimized attention kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496, 2 0 (3): 0 4, 2024
Pith/arXiv arXiv 2024
-
[23]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Pith/arXiv arXiv 2010
-
[24]
Cartridges: Lightweight and general-purpose long context representations via self-study
Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general-purpose long context representations via self-study. arXiv preprint arXiv:2506.06266, 2025
arXiv 2025
-
[25]
Simple context compression: Mean-pooling and multi-ratio training
Yair Feldman and Yoav Artzi. Simple context compression: Mean-pooling and multi-ratio training. arXiv preprint arXiv:2510.20797, 2025
Pith/arXiv arXiv 2025
-
[26]
How to train long-context language models (effectively)
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7376--7399, 2025
2025
-
[27]
In-context autoencoder for context compression in a large language model
Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023
arXiv 2023
-
[28]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024
2024
-
[29]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021
Pith/arXiv arXiv 2021
-
[30]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[31]
Why mean pooling works: Quantifying second-order collapse in text embeddings
Tomomasa Hara, Hiroto Kurita, Masaaki Imaizumi, Kentaro Inui, and Sho Yokoi. Why mean pooling works: Quantifying second-order collapse in text embeddings. arXiv preprint arXiv:2604.27398, 2026
Pith/arXiv arXiv 2026
-
[32]
Scaling instruction-tuned llms to million-token contexts via hierarchical synthetic data generation
Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, and Ce Zhang. Scaling instruction-tuned llms to million-token contexts via hierarchical synthetic data generation. arXiv preprint arXiv:2504.12637, 2025
arXiv 2025
-
[33]
Gaussian error linear units (gelus)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016
Pith/arXiv arXiv 2016
-
[34]
Kvquant: Towards 10 million context length llm inference with kv cache quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems, 37: 0 1270--1303, 2024
2024
-
[35]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024
Pith/arXiv arXiv 2024
-
[36]
Llmlingua: Compressing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023
arXiv 2023
-
[37]
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658--1677, 2024
2024
-
[38]
Pubmedqa: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567--2577, 2019
2019
-
[39]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
Pith/arXiv arXiv 2017
-
[40]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020
2020
-
[41]
Kvzip: Query-agnostic kv cache compression with context reconstruction
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction. arXiv preprint arXiv:2505.23416, 2025
arXiv 2025
-
[42]
Fast kvzip: Efficient and accurate llm inference with gated kv eviction
Jang-Hyun Kim, Dongyoon Han, and Sangdoo Yun. Fast kvzip: Efficient and accurate llm inference with gated kv eviction. arXiv preprint arXiv:2601.17668, 2026
arXiv 2026
-
[43]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626, 2023
2023
-
[44]
Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...
2024
-
[45]
Revisiting catastrophic forgetting in large language model tuning
Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. In Findings of the association for computational linguistics: EMNLP 2024, pages 4297--4308, 2024 a
2024
-
[46]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021
Pith/arXiv arXiv 2021
-
[47]
Compressing context to enhance inference efficiency of large language models
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 6342--6353, 2023
2023
-
[48]
Scbench: A kv cache-centric analysis of long-context methods
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Scbench: A kv cache-centric analysis of long-context methods. arXiv preprint arXiv:2412.10319, 2024 b
arXiv 2024
-
[49]
Snapkv: Llm knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37: 0 22947--22970, 2024 c
2024
-
[50]
500xcompressor: Generalized prompt compression for large language models
Zongqian Li, Yixuan Su, and Nigel Collier. 500xcompressor: Generalized prompt compression for large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25081--25091, 2025
2025
-
[51]
E2llm: Encoder elongated large language models for long-context understanding and reasoning
Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, and Wei Zhang. E2llm: Encoder elongated large language models for long-context understanding and reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19212--19241, 2025
2025
-
[52]
Refrag: Rethinking rag based decoding
Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, and Vijai Mohan. Refrag: Rethinking rag based decoding. arXiv preprint arXiv:2509.01092, 2025
arXiv 2025
-
[53]
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a
Pith/arXiv arXiv 2024
-
[54]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36: 0 34892--34916, 2023 a
2023
-
[55]
Lost in the middle: How language models use long contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12: 0 157--173, 2024 b
2024
-
[56]
Repobench: Benchmarking repository-level code auto-completion systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023 b
Pith/arXiv arXiv 2023
-
[57]
Rag-instruct: Boosting llms with diverse retrieval-augmented instructions, 2024 c
Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, and Benyou Wang. Rag-instruct: Boosting llms with diverse retrieval-augmented instructions, 2024 c . URL https://arxiv.org/abs/2501.00353
arXiv 2024
-
[58]
Chatqa: Building gpt-4 level conversational qa models
Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa: Building gpt-4 level conversational qa models. CoRR, 2024 d
2024
-
[59]
Starcoder 2 and the stack v2: The next generation, 2024
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...
2024
-
[60]
An empirical study of catastrophic forgetting in large language models during continual fine-tuning
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE Transactions on Audio, Speech and Language Processing, 2025
2025
-
[61]
Learning to compress prompts with gist tokens
Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36: 0 19327--19352, 2023
2023
-
[62]
Octopack: Instruction tuning code large language models
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023
arXiv 2023
-
[63]
Nemotron-Post-Training-Dataset-v2 , aug 2025 a
Dhruv Nathawani, Shuoyang Ding, Vitaly Lavrukhin, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Boris Ginsburg, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v2 , aug 2025 a . URL https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2
2025
-
[64]
Nemotron-Post-Training-Dataset-v1 , July 2025 b
Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1 , July 2025 b . URL https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1
2025
-
[65]
NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...
Pith/arXiv arXiv 2025
-
[66]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shan...
Pith/arXiv arXiv 2025
-
[67]
Introducing codex, May 2025
OpenAI . Introducing codex, May 2025. URL https://openai.com/index/introducing-codex/. Accessed: 2026-01-09
2025
-
[68]
Patil, Ion Stoica, and Joseph E
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards LLM s as operating systems, 2023. URL https://arxiv.org/abs/2310.08560
Pith/arXiv arXiv 2023
-
[69]
Finewiki, 2025
Guilherme Penedo. Finewiki, 2025. URL https://huggingface.co/datasets/HuggingFaceFW/finewiki. Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors
2025
-
[70]
Yarn: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023
Pith/arXiv arXiv 2023
-
[71]
Arc-encoder: learning compressed text representations for large language models
Hippolyte Pilchen, Edouard Grave, and Patrick P \'e rez. Arc-encoder: learning compressed text representations for large language models. arXiv preprint arXiv:2510.20535, 2025
arXiv 2025
-
[72]
Qwen3-VL
Qwen Team . Qwen3-VL . https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list, 2025. Technical report
2025
-
[73]
Compressive transformers for long-range sequence modelling
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019
Pith/arXiv arXiv 1911
-
[74]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
2020
-
[75]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
2024
-
[76]
Exploiting sparsity for long context inference: Million token contexts on commodity gpus
Ryan Synk, Monte Hoover, John Kirchenbauer, Neel Jain, Alex Stein, Manli Shu, Josue Melendez Sanchez, Ramani Duraiswami, and Tom Goldstein. Exploiting sparsity for long context inference: Million token contexts on commodity gpus. arXiv preprint arXiv:2502.06766, 2025
arXiv 2025
-
[77]
Lloco: Learning long contexts offline
Sijun Tan, Xiuyu Li, Shishir G Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E Gonzalez, and Raluca Ada Popa. Lloco: Learning long contexts offline. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17605--17621, 2024
2024
-
[78]
Gmsa: Enhancing context compression via group merging and layer semantic alignment
Jiwei Tang, Zhicheng Zhang, Shunlong Wu, Jingheng Ye, Lichen Bai, Zitai Wang, Tingwei Lu, Jiaqi Chen, Lin Hai, Hai-Tao Zheng, et al. Gmsa: Enhancing context compression via group merging and layer semantic alignment. arXiv preprint arXiv:2505.12215, 2025
arXiv 2025
-
[79]
Kimi linear: An expressive, efficient attention architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025
Pith/arXiv arXiv 2025
-
[80]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37: 0 87310--87356, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.