arxiv: 2605.09100 · v2 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

Zhongtao Miao , Qiyu Wu , Yoshimasa Tsuruoka

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords unified LLM trainingmeta latent tokenscontext compressionretrieval-augmented generationlatent memory-augmented generationsingle forward passmodular inference

0 comments

The pith

A single training framework lets large language models handle generation, retrieval, and context compression together in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models today require separate training for generative tasks like text production and for embedding tasks used in retrieval, which multiplies costs and deployment work. Context compression adds another separate challenge, especially for systems that need long contexts or continual learning. The paper introduces a framework that brings these three capabilities into one model by using special meta latent tokens along with a combined tuning process that covers generation, representation, and compression at once. The resulting model completes all three objectives during a single forward pass yet still allows flexible, modular use at inference time. This setup cuts down on separate model deployments and improves how training data is used across the tasks.

Core claim

Through meta latent tokens and a unified generative, representative and compressive tuning approach, the GRC framework trains models to accomplish reasoning-driven generation, reasoning-enhanced text representation, and context compression in a single forward pass while keeping modular, LEGO-style flexibility during inference.

What carries the argument

Meta latent tokens paired with unified generative, representative, and compressive tuning, which let one model switch between the three behaviors inside the same forward computation.

If this is right

Retrieval-augmented generation systems can be deployed with far less separate infrastructure because one model covers the needed capabilities.
Training data is used three times more efficiently since the same examples serve generation, retrieval, and compression objectives.
Text embedding shifts to a self-reason-latent style where the model reasons internally before producing the embedding vector.
Generation can draw on latent memory-augmented setups that treat compressed and internalized KV cache as updatable memory of constant length.
Inference speed improves through the addition of hybrid paged attention that works with the compressed representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same meta-token approach could be extended to other multi-task settings such as translation combined with summarization without new training runs for each pair.
Agentic systems that switch between planning, tool use, and memory management might become simpler if they inherit the single-pass modularity shown here.
Future scaling laws for unified models would need to account for the shared internal representations rather than assuming additive costs for each added capability.

Load-bearing premise

That adding meta latent tokens and unified tuning will let the model keep strong performance on all three tasks at once without forcing trade-offs or needing separate optimization paths.

What would settle it

A head-to-head test where the unified GRC model is run on standard generation, retrieval, and compression benchmarks and compared directly against three separate models each optimized for only one of those tasks.

Figures

Figures reproduced from arXiv: 2605.09100 by Qiyu Wu, Yoshimasa Tsuruoka, Zhongtao Miao.

**Figure 2.** Figure 2: Four diverse generation patterns: (1) regular generation; (2) self-reasonlatent-based query embedding; (3) document embedding and (4) latent memoryaugmented generation. Difference with GritLM. The actual inference implementation of GritLM loads AutoModel version for embedding tasks and AutoModelForCusalLM for generative tasks4 , which means that we need to host two model replica in device memory if we… view at source ↗

**Figure 3.** Figure 3: Comparison of document KV cache storage size. Hybrid paged attention for LLM serving. To further speed up the inference, we propose hybrid paged attention (HPA) to construct a new inference engine. This proposed method is based on paged attention [29]. Paged attention is an attention algorithm inspired by the virtual memory and paging techniques in operating systems for LLM serving. It achieves high throu… view at source ↗

**Figure 4.** Figure 4: KV cache management with the proposed hybrid paged attention for speeding up inference. Training data. Various reasoning-based questionanswering (QA) and retrieval data are utilized for the model training. The training dataset consists of general reasoning-based QA pairs, reasoning-intensive queries with positive and negative documents, minor agentic data. We collected approximately 600K training examp… view at source ↗

**Figure 5.** Figure 5: NDCG@10 scores of GRC-1.7B with different max new tokens (128, 256, 512, 1024, 2048, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Reconstruction scores of GRC models on the wikipedia markdown documents compression [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of latency (average consumed time (s) for one user query). Note that the latency [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Naive inference implementation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRC tries to train one model for generation, retrieval, and compression via meta latent tokens, but the abstract leaves the actual performance numbers and ablations out so it's still unclear how well the unification holds up.

read the letter

The paper's main move is to introduce meta latent tokens plus a single unified tuning process so an LLM can handle reasoning-driven generation, text representation for retrieval, and context compression all in one forward pass. The authors claim this cuts training cost, triples data use, and lets the model switch between the three tasks at inference with modular flexibility, plus new tricks like self-reason-latent embeds and latent memory-augmented generation that treats compressed KV cache as updatable memory. They also add hybrid paged attention to keep inference fast. If the experiments actually show no big trade-offs, this would be a practical win for RAG and agent pipelines that currently train separate models.

Referee Report

0 major / 2 minor

Summary. The paper presents GRC, a training framework that unifies reasoning-driven generation, reasoning-enhanced text representation, and context compression for large language models. By leveraging meta latent tokens and a unified tuning approach encompassing generative, representative, and compressive objectives, the model can perform all three tasks in a single forward pass while offering modular flexibility during inference. This design aims to minimize training costs and deployment efforts for RAG systems, achieve efficient inference, and triple data utilization in training. New paradigms are introduced, including self-reason-latent embeds and latent memory-augmented generation using internalized compressed KV caches of constant length. Hybrid paged attention is proposed to accelerate inference. The effectiveness is supported by extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency, and RAG settings.

Significance. Should the central claims be validated by the experiments, this work would be significant in advancing unified LLM architectures that handle multiple related tasks without the overhead of separate training and deployment. It offers a path to more efficient handling of long-context and reasoning tasks in agentic applications. The modular design and new paradigms for generation and embedding are noteworthy contributions. Credit is given for the broad experimental coverage and the introduction of hybrid paged attention for practical inference improvements.

minor comments (2)

[Abstract] The abstract summarizes the claims effectively but would benefit from including one or two key quantitative results (e.g., performance deltas on main benchmarks) to immediately convey the scale of the reported effectiveness.
[§3] The term 'LEGO-style flexibility' is evocative but would be clearer with a concrete example or diagram illustrating how modules can be combined or swapped at inference time.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or load-bearing steps that reduce a claimed prediction or first-principles result to its own inputs by construction. Claims about meta latent tokens and unified GRC tuning are presented as empirical outcomes of the proposed framework rather than mathematical reductions. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are visible. The derivation chain is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details are deferred to the full paper which was not available.

pith-pipeline@v0.9.0 · 5562 in / 1039 out tokens · 92767 ms · 2026-05-13T06:36:06.446756+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 9 internal anchors

[1]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Assoc...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[4]

LLM2vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. LLM2vec: Large language models are secretly powerful text encoders. InFirst Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=IW1PR7vEBf

work page 2024
[5]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Language- agnostic BERT sentence embedding

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland, May

work page
[7]

doi: 10.18653/v1/2022.acl-long.62

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.62. URL https://aclanthology.org/2022.acl-long.62/

work page doi:10.18653/v1/2022.acl-long.62 2022
[8]

Enhancing cross-lingual sentence embedding for low-resource languages with word alignment

Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. Enhancing cross-lingual sentence embedding for low-resource languages with word alignment. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3225–3236, Mexico City, Mexico, June 2024. Association for Co...

work page doi:10.18653/v1/2024.findings-naacl.204 2024
[9]

In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics....

work page doi:10.18653/v1/2023.eacl-main.148 2014
[10]

In: Vlachos, A., Augen- stein, I

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v...

work page doi:10.18653/v1/2023 2023
[11]

In-context autoencoder for context compression in a large language model

Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=uREj4ZuGJE

work page 2024
[12]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe Fourteenth International Conference on Learning Representations,

work page
[13]

URLhttps://openreview.net/forum?id=eC4ygDs02R

work page
[14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa...

work page 2020
[15]

Towards reasoning in large language models: A survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada, July

work page 2023
[16]

Towards reasoning in large language models: A survey

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67/

work page doi:10.18653/v1/2023.findings-acl.67 2023
[17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Toward large reasoning models: A survey of reinforced reasoning with large language models.Patterns, 6(10):101370, 2025

Fengli Xu, Qianyue Hao, Chenyang Shao, Zefang Zong, Yu Li, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Yuwei Yan, Qing- long Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Jie Feng, Chen Gao, and Yong Li. Toward large reasoning models: A survey of reinforced reasoning with large language models.Patterns, 6(...

work page doi:10.1016/j.patter.2025.101370 2025
[19]

ISBN 979-8-89176-332-6

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...

work page doi:10.18653/v1/2025.emnlp-main 2025
[20]

URLhttps://aclanthology.org/2025.emnlp-main.1025/

work page 2025
[21]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test time scaling for code generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978, Suzhou, ...

work page 2025
[22]

URL https://aclanthology.org/2025

doi: 10.18653/v1/2025.findings-emnlp.865. URL https://aclanthology.org/2025. findings-emnlp.865/. 11

work page doi:10.18653/v1/2025.findings-emnlp.865 2025
[23]

Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

work page doi:10.18653/v1/2025.acl-long.232 2025
[24]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

work page 2024
[25]

Improving arithmetic reasoning ability of large language models through relation tuples, verification and dynamic feedback.arXiv preprint arXiv:2406.17873, 2024

Zhongtao Miao, Kaiyan Zhao, and Yoshimasa Tsuruoka. Improving arithmetic reasoning ability of large language models through relation tuples, verification and dynamic feedback.arXiv preprint arXiv:2406.17873, 2024

work page arXiv 2024
[26]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

work page 2022
[27]

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, and Yoshimasa Tsuruoka. Neoamt: Neologism-aware agentic machine translation with reinforcement learning.arXiv preprint arXiv:2601.03790, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[29]

BRIGHT: A realistic and challenging benchmark for reasoning- intensive retrieval

Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han yu Wang, Liu Haisu, Quan Shi, Zachary S Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O Arik, Danqi Chen, and Tao Yu. BRIGHT: A realistic and challenging benchmark for reasoning- intensive retrieval. InThe Thirteenth International Conference on Learning Representations,

work page
[30]

URLhttps://openreview.net/forum?id=ykuc5q381b

work page
[31]

ReasonIR: Training retrievers for reasoning tasks

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen tau Yih, Pang Wei Koh, and Luke Zettlemoyer. ReasonIR: Training retrievers for reasoning tasks. InSecond Conference on Language Modeling,

work page
[32]

URLhttps://openreview.net/forum?id=kkBCNLMbGj

work page
[33]

Generative representational instruction tuning

Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=BC4lIvfSzv

work page 2025
[34]

Large reasoning embedding models: Towards next-generation dense retrieval paradigm

Jianting Tang, Dongshuai Li, Tao Wen, Fuyu Lv, Dan Ou, and Linli Xu. Large reasoning embedding models: Towards next-generation dense retrieval paradigm. InProceedings of the ACM Web Conference 2026, WWW ’26, page 8115–8126, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. doi: 10.1145/3774904.3792826. URLhttps://doi.org/1...

work page doi:10.1145/3774904.3792826 2026
[35]

UME-r1: Exploring reasoning-driven generative multimodal embeddings

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. UME-r1: Exploring reasoning-driven generative multimodal embeddings. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= 2ius36JQUJ

work page 2026
[36]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machin...

work page doi:10.1145/3600006.3613165 2023
[37]

Learning to compress prompts with gist tokens

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=2DtxPCL3T5

work page 2023
[38]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 ofPro- ceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020. URL https:...

work page 2020
[39]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: AC...

work page doi:10.18653/v1/2023.findings-acl.824 2023
[41]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page doi:10.1162/tacl_a_00276 2019
[42]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=wCu6T5xFjeJ

work page 2021
[43]

Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Ama...

work page 2023
[44]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Dense text retrieval based on pretrained language models: A survey.ACM Trans

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey.ACM Trans. Inf. Syst., 42(4), February 2024. ISSN 1046-8188. doi: 10.1145/3637870. URLhttps://doi.org/10.1145/3637870

work page doi:10.1145/3637870 2024
[46]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769– 6781, Online...

work page doi:10.18653/v1/ 2020
[47]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Jul 2020. URL https:// proceedings.mlr.p...

work page 2020
[48]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023).https://doi.org/ 10.18653/v1/2023.emnlp-main.391

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics...

work page doi:10.18653/v1/2023.emnlp-main.391 2023
[49]

Prompt compression for large language models: A survey

Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. Prompt compression for large language models: A survey. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7182–7195,...

work page doi:10.18653/v1/2025.naacl-long.368 2025
[50]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation ...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[51]

In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for ...

work page doi:10.18653/v1/2024.findings-acl.57 2024
[52]

xrag: Extreme context compression for retrieval-augmented generation with one token

Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 109...

work page 2024
[53]

Context cascade compression: Exploring the upper limits of text compression.arXiv preprint arXiv:2511.15244, 2025

Fanfan Liu and Haibo Qiu. Context cascade compression: Exploring the upper limits of text compression.arXiv preprint arXiv:2511.15244, 2025

work page arXiv 2025
[54]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Learning to compress prompts with gist tokens

Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 19327–19352. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 3d77c6dcc7f143aa...

work page 2023
[56]

Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, August 20...

work page doi:10.14778/3611540.3611569 2023
[57]

Scaling deep contrastive learning batch size under memory limited setup

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. In Anna Rogers, Iacer Calixto, Ivan Vuli´c, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors,Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 316–321...

work page doi:10.18653/v1/2021.repl4nlp-1.31 2021
[58]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page 2020
[59]

arXiv preprint arXiv:2503.19633 , year=

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

work page arXiv 2025
[60]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025
[61]

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

Jianlyu Chen, Junwei Lan, Chaofan Li, Defu Lian, and Zheng Liu. Reasonembed: Enhanced text embeddings for reasoning-intensive document retrieval.arXiv preprint arXiv:2510.08252, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

work page arXiv 2025
[63]

Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents.arXiv preprint arXiv:2510.24702, 2025

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents.arXiv preprint arXiv:2510.24702, 2025

work page arXiv 2025
[64]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

work page
[65]

What was discussed in the previous conversation?

URL https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf. 16 A Training details Hardware.We use 32 computing nodes, each equipped with an NVIDIA GH200 Grace Hopper Superchip. Software.We use PyTorch (2.9.0+cu128) FSDP [ 49], transformers (4.57.1), datasets (4.3.0) and accelerate (1.11.0) for training. We w...

work page 2016
[66]

Identify the essential problem

work page
[67]

Think step by step to reason and describe what information could be relevant and helpful to address the questions in detail

work page
[68]

{document}

Draft an answer with as many thoughts as you have. The given question: {query} Table 10: GRC document prompt. Document prompt Represent this text: "{document}" RAG.RAG evaluation is conducted on a server with 8 NVIDIA A100 GPUs with 80GB device memory. Hybrid paged attention.The latency testing of hybrid paged attention is also conducted on a server with ...

work page
[69]

s eq uen ce s

: 20batch_size , i n i t i a l _ l e n = in pu t_ ids . shape 21 22out = self . g e n e r a t e _ w i t h _ k v ( 23in pu t_ ids = input_ids , 24m a x _ n e w _ t o k e n s = max_new_tokens , 25m a x _ l e n g t h = max_length , 26t e m p e r a t u r e = temperature , 27top_k = top_k , 28top_p = top_p , 29r e p e t i t i o n _ p e n a l t y = re pe ti ti ...

work page