pith. machine review for the scientific record. sign in

arxiv: 2605.09100 · v2 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords unified LLM trainingmeta latent tokenscontext compressionretrieval-augmented generationlatent memory-augmented generationsingle forward passmodular inference
0
0 comments X

The pith

A single training framework lets large language models handle generation, retrieval, and context compression together in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models today require separate training for generative tasks like text production and for embedding tasks used in retrieval, which multiplies costs and deployment work. Context compression adds another separate challenge, especially for systems that need long contexts or continual learning. The paper introduces a framework that brings these three capabilities into one model by using special meta latent tokens along with a combined tuning process that covers generation, representation, and compression at once. The resulting model completes all three objectives during a single forward pass yet still allows flexible, modular use at inference time. This setup cuts down on separate model deployments and improves how training data is used across the tasks.

Core claim

Through meta latent tokens and a unified generative, representative and compressive tuning approach, the GRC framework trains models to accomplish reasoning-driven generation, reasoning-enhanced text representation, and context compression in a single forward pass while keeping modular, LEGO-style flexibility during inference.

What carries the argument

Meta latent tokens paired with unified generative, representative, and compressive tuning, which let one model switch between the three behaviors inside the same forward computation.

If this is right

  • Retrieval-augmented generation systems can be deployed with far less separate infrastructure because one model covers the needed capabilities.
  • Training data is used three times more efficiently since the same examples serve generation, retrieval, and compression objectives.
  • Text embedding shifts to a self-reason-latent style where the model reasons internally before producing the embedding vector.
  • Generation can draw on latent memory-augmented setups that treat compressed and internalized KV cache as updatable memory of constant length.
  • Inference speed improves through the addition of hybrid paged attention that works with the compressed representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same meta-token approach could be extended to other multi-task settings such as translation combined with summarization without new training runs for each pair.
  • Agentic systems that switch between planning, tool use, and memory management might become simpler if they inherit the single-pass modularity shown here.
  • Future scaling laws for unified models would need to account for the shared internal representations rather than assuming additive costs for each added capability.

Load-bearing premise

That adding meta latent tokens and unified tuning will let the model keep strong performance on all three tasks at once without forcing trade-offs or needing separate optimization paths.

What would settle it

A head-to-head test where the unified GRC model is run on standard generation, retrieval, and compression benchmarks and compared directly against three separate models each optimized for only one of those tasks.

Figures

Figures reproduced from arXiv: 2605.09100 by Qiyu Wu, Yoshimasa Tsuruoka, Zhongtao Miao.

Figure 1
Figure 1. Figure 1: GRC training framework. One forward pass of a single model can fulfill three objectives: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Four diverse generation patterns: (1) regular generation; (2) self-reason￾latent-based query embedding; (3) docu￾ment embedding and (4) latent memory￾augmented generation. Difference with GritLM. The actual inference im￾plementation of GritLM loads AutoModel version for embedding tasks and AutoModelForCusalLM for gen￾erative tasks4 , which means that we need to host two model replica in device memory if we… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of doc￾ument KV cache storage size. Hybrid paged attention for LLM serving. To further speed up the inference, we propose hybrid paged attention (HPA) to construct a new inference engine. This proposed method is based on paged attention [29]. Paged attention is an attention algorithm inspired by the virtual memory and paging techniques in operating systems for LLM serving. It achieves high throu… view at source ↗
Figure 4
Figure 4. Figure 4: KV cache management with the proposed hybrid paged attention for speed￾ing up inference. Training data. Various reasoning-based question￾answering (QA) and retrieval data are utilized for the model training. The training dataset consists of gen￾eral reasoning-based QA pairs, reasoning-intensive queries with positive and negative documents, minor agentic data. We collected approximately 600K train￾ing examp… view at source ↗
Figure 5
Figure 5. Figure 5: NDCG@10 scores of GRC-1.7B with different max new tokens (128, 256, 512, 1024, 2048, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reconstruction scores of GRC models on the wikipedia markdown documents compression [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of latency (average consumed time (s) for one user query). Note that the latency [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Naive inference implementation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper presents GRC, a training framework that unifies reasoning-driven generation, reasoning-enhanced text representation, and context compression for large language models. By leveraging meta latent tokens and a unified tuning approach encompassing generative, representative, and compressive objectives, the model can perform all three tasks in a single forward pass while offering modular flexibility during inference. This design aims to minimize training costs and deployment efforts for RAG systems, achieve efficient inference, and triple data utilization in training. New paradigms are introduced, including self-reason-latent embeds and latent memory-augmented generation using internalized compressed KV caches of constant length. Hybrid paged attention is proposed to accelerate inference. The effectiveness is supported by extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency, and RAG settings.

Significance. Should the central claims be validated by the experiments, this work would be significant in advancing unified LLM architectures that handle multiple related tasks without the overhead of separate training and deployment. It offers a path to more efficient handling of long-context and reasoning tasks in agentic applications. The modular design and new paradigms for generation and embedding are noteworthy contributions. Credit is given for the broad experimental coverage and the introduction of hybrid paged attention for practical inference improvements.

minor comments (2)
  1. [Abstract] The abstract summarizes the claims effectively but would benefit from including one or two key quantitative results (e.g., performance deltas on main benchmarks) to immediately convey the scale of the reported effectiveness.
  2. [§3] The term 'LEGO-style flexibility' is evocative but would be clearer with a concrete example or diagram illustrating how modules can be combined or swapped at inference time.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or load-bearing steps that reduce a claimed prediction or first-principles result to its own inputs by construction. Claims about meta latent tokens and unified GRC tuning are presented as empirical outcomes of the proposed framework rather than mathematical reductions. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are visible. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details are deferred to the full paper which was not available.

pith-pipeline@v0.9.0 · 5562 in / 1039 out tokens · 92767 ms · 2026-05-13T06:36:06.446756+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 9 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  2. [2]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  3. [3]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Assoc...

  4. [4]

    LLM2vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. LLM2vec: Large language models are secretly powerful text encoders. InFirst Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=IW1PR7vEBf

  5. [5]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  6. [6]

    Language- agnostic BERT sentence embedding

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland, May

  7. [7]

    doi: 10.18653/v1/2022.acl-long.62

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.62. URL https://aclanthology.org/2022.acl-long.62/

  8. [8]

    Enhancing cross-lingual sentence embedding for low-resource languages with word alignment

    Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. Enhancing cross-lingual sentence embedding for low-resource languages with word alignment. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3225–3236, Mexico City, Mexico, June 2024. Association for Co...

  9. [9]

    In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics....

  10. [10]

    In: Vlachos, A., Augen- stein, I

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v...

  11. [11]

    In-context autoencoder for context compression in a large language model

    Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=uREj4ZuGJE

  12. [12]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe Fourteenth International Conference on Learning Representations,

  13. [13]

    URLhttps://openreview.net/forum?id=eC4ygDs02R

  14. [14]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa...

  15. [15]

    Towards reasoning in large language models: A survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada, July

  16. [16]

    Towards reasoning in large language models: A survey

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67/

  17. [17]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  18. [18]

    Toward large reasoning models: A survey of reinforced reasoning with large language models.Patterns, 6(10):101370, 2025

    Fengli Xu, Qianyue Hao, Chenyang Shao, Zefang Zong, Yu Li, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Yuwei Yan, Qing- long Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Jie Feng, Chen Gao, and Yong Li. Toward large reasoning models: A survey of reinforced reasoning with large language models.Patterns, 6(...

  19. [19]

    ISBN 979-8-89176-332-6

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

  20. [20]

    URLhttps://aclanthology.org/2025.emnlp-main.1025/

  21. [21]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978, Suzhou, ...

  22. [22]

    URL https://aclanthology.org/2025

    doi: 10.18653/v1/2025.findings-emnlp.865. URL https://aclanthology.org/2025. findings-emnlp.865/. 11

  23. [23]

    Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

  24. [24]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

  25. [25]

    Improving arithmetic reasoning ability of large language models through relation tuples, verification and dynamic feedback.arXiv preprint arXiv:2406.17873, 2024

    Zhongtao Miao, Kaiyan Zhao, and Yoshimasa Tsuruoka. Improving arithmetic reasoning ability of large language models through relation tuples, verification and dynamic feedback.arXiv preprint arXiv:2406.17873, 2024

  26. [26]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

  27. [27]

    NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

    Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, and Yoshimasa Tsuruoka. Neoamt: Neologism-aware agentic machine translation with reinforcement learning.arXiv preprint arXiv:2601.03790, 2026

  28. [28]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  29. [29]

    BRIGHT: A realistic and challenging benchmark for reasoning- intensive retrieval

    Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han yu Wang, Liu Haisu, Quan Shi, Zachary S Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O Arik, Danqi Chen, and Tao Yu. BRIGHT: A realistic and challenging benchmark for reasoning- intensive retrieval. InThe Thirteenth International Conference on Learning Representations,

  30. [30]

    URLhttps://openreview.net/forum?id=ykuc5q381b

  31. [31]

    ReasonIR: Training retrievers for reasoning tasks

    Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen tau Yih, Pang Wei Koh, and Luke Zettlemoyer. ReasonIR: Training retrievers for reasoning tasks. InSecond Conference on Language Modeling,

  32. [32]

    URLhttps://openreview.net/forum?id=kkBCNLMbGj

  33. [33]

    Generative representational instruction tuning

    Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=BC4lIvfSzv

  34. [34]

    Large reasoning embedding models: Towards next-generation dense retrieval paradigm

    Jianting Tang, Dongshuai Li, Tao Wen, Fuyu Lv, Dan Ou, and Linli Xu. Large reasoning embedding models: Towards next-generation dense retrieval paradigm. InProceedings of the ACM Web Conference 2026, WWW ’26, page 8115–8126, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. doi: 10.1145/3774904.3792826. URLhttps://doi.org/1...

  35. [35]

    UME-r1: Exploring reasoning-driven generative multimodal embeddings

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. UME-r1: Exploring reasoning-driven generative multimodal embeddings. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= 2ius36JQUJ

  36. [36]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machin...

  37. [37]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=2DtxPCL3T5

  38. [38]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 ofPro- ceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020. URL https:...

  39. [39]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  40. [40]

    Le, Ed H

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: AC...

  41. [41]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  42. [42]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=wCu6T5xFjeJ

  43. [43]

    Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Ama...

  44. [44]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

  45. [45]

    Dense text retrieval based on pretrained language models: A survey.ACM Trans

    Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey.ACM Trans. Inf. Syst., 42(4), February 2024. ISSN 1046-8188. doi: 10.1145/3637870. URLhttps://doi.org/10.1145/3637870

  46. [46]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769– 6781, Online...

  47. [47]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Jul 2020. URL https:// proceedings.mlr.p...

  48. [48]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023).https://doi.org/ 10.18653/v1/2023.emnlp-main.391

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics...

  49. [49]

    Prompt compression for large language models: A survey

    Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. Prompt compression for large language models: A survey. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7182–7195,...

  50. [50]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation ...

  51. [51]

    In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for ...

  52. [52]

    xrag: Extreme context compression for retrieval-augmented generation with one token

    Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 109...

  53. [53]

    Context cascade compression: Exploring the upper limits of text compression.arXiv preprint arXiv:2511.15244, 2025

    Fanfan Liu and Haibo Qiu. Context cascade compression: Exploring the upper limits of text compression.arXiv preprint arXiv:2511.15244, 2025

  54. [54]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 15

  55. [55]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 19327–19352. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 3d77c6dcc7f143aa...

  56. [56]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, August 20...

  57. [57]

    Scaling deep contrastive learning batch size under memory limited setup

    Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors,Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 316–321...

  58. [58]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

  59. [59]

    arXiv preprint arXiv:2503.19633 , year=

    Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

  60. [60]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

  61. [61]

    ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

    Jianlyu Chen, Junwei Lan, Chaofan Li, Defu Lian, and Zheng Liu. Reasonembed: Enhanced text embeddings for reasoning-intensive document retrieval.arXiv preprint arXiv:2510.08252, 2025

  62. [62]

    Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

  63. [63]

    Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents.arXiv preprint arXiv:2510.24702, 2025

    Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents.arXiv preprint arXiv:2510.24702, 2025

  64. [64]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

  65. [65]

    What was discussed in the previous conversation?

    URL https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf. 16 A Training details Hardware.We use 32 computing nodes, each equipped with an NVIDIA GH200 Grace Hopper Superchip. Software.We use PyTorch (2.9.0+cu128) FSDP [ 49], transformers (4.57.1), datasets (4.3.0) and accelerate (1.11.0) for training. We w...

  66. [66]

    Identify the essential problem

  67. [67]

    Think step by step to reason and describe what information could be relevant and helpful to address the questions in detail

  68. [68]

    {document}

    Draft an answer with as many thoughts as you have. The given question: {query} Table 10: GRC document prompt. Document prompt Represent this text: "{document}" RAG.RAG evaluation is conducted on a server with 8 NVIDIA A100 GPUs with 80GB device memory. Hybrid paged attention.The latency testing of hybrid paged attention is also conducted on a server with ...

  69. [69]

    s eq uen ce s

    : 20batch_size , i n i t i a l _ l e n = in pu t_ ids . shape 21 22out = self . g e n e r a t e _ w i t h _ k v ( 23in pu t_ ids = input_ids , 24m a x _ n e w _ t o k e n s = max_new_tokens , 25m a x _ l e n g t h = max_length , 26t e m p e r a t u r e = temperature , 27top_k = top_k , 28top_p = top_p , 29r e p e t i t i o n _ p e n a l t y = re pe ti ti ...