arxiv: 2604.14403 · v1 · submitted 2026-04-15 · 💻 cs.IR

Recognition: unknown

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

Julian Killingback , Ofer Meshi , Henry Li , Hamed Zamani , Maryam Karimzadehgan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3

classification 💻 cs.IR

keywords on-device RAGretrieval-augmented generationcontext compressionunified modeldocument representationon-device deploymentlocal retrieval

0 comments

The pith

A single model and representation can handle both document retrieval and context compression for on-device RAG while matching full-context performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that retrieval and context compression for generation do not require separate models or representations. Instead, one learned document representation supports both tasks, allowing the generative model to receive only about one tenth the usual context length. This keeps answer quality comparable to traditional server-based RAG and avoids any increase in storage compared with multi-vector retrievers. A sympathetic reader would care because the method removes the need for cloud access when querying private local data such as medical or financial records.

Core claim

The authors introduce a unified model that produces a shared document representation usable for both retrieving relevant passages and compressing the retrieved content into a short context for the generator. On standard benchmarks this yields performance on par with conventional RAG pipelines while using an average of 1/10 the context size and without raising storage costs above those of a multi-vector retrieval model.

What carries the argument

The unified model whose single learned document representation supports both retrieval scoring and context compression for generation.

If this is right

On-device pipelines become practical for personal data without internet or external servers.
KV cache and attention memory demands on the generative model drop sharply because far less context is supplied.
Disk space stays equivalent to a multi-vector retriever since no extra embeddings are stored.
The same representation can replace two separate components in future on-device systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification approach could be extended to merge additional on-device tasks such as summarization into the same representation.
Real-world tests on mobile hardware would reveal actual latency and energy savings beyond benchmark numbers.
Similar shared-representation designs might apply to other resource-constrained settings where retrieval and generation compete for memory.

Load-bearing premise

A single learned representation can simultaneously support high-quality retrieval and effective context compression without substantial quality loss under tight on-device memory limits.

What would settle it

A side-by-side evaluation on a standard RAG benchmark in which the unified model's generation accuracy falls measurably below that of a traditional full-context RAG system despite the reduced context length.

Figures

Figures reproduced from arXiv: 2604.14403 by Hamed Zamani, Henry Li, Julian Killingback, Maryam Karimzadehgan, Ofer Meshi.

**Figure 2.** Figure 2: Accuracy (EM) of various retriever and reader combinations on NQ across different [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt used for standard RAG reader model. Note that context text was separated [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt used for parametric (no retrieval) model. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for encoding/compression for the ECG and context compression [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for generating with compressed context representations. This is used [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper unifies retrieval and context compression into one on-device model and representation, claiming standard RAG performance at 1/10 context size with no added storage cost.

read the letter

The main takeaway is that the authors built a single model whose representations serve both retrieval and context compression for on-device RAG. They report that this setup matches a conventional RAG reader while using roughly one-tenth the context on average and without increasing storage relative to a multi-vector baseline. That consolidation directly targets the memory and disk constraints that make full RAG hard to run locally on phones or edge devices. The motivation section lays out the privacy and latency problems with server-side RAG clearly, and the shared-representation idea is a straightforward engineering response to those limits. If the experiments are clean, the approach could be useful for anyone shipping local search over personal documents. The paper does a decent job stating the practical constraints up front and framing the unification as the first attempt to handle both tasks inside one model rather than bolting separate components together. The claim of no extra storage is attractive on paper because it avoids the usual multi-vector overhead. The soft spots sit in the evaluation. The headline performance number is given without visible baselines, dataset details, or ablations in the abstract, so it is difficult to judge whether retrieval quality holds or whether generation accuracy trades off under the reduced context. The central assumption—that one learned representation can support high-quality retrieval and effective compression without substantial loss—needs the full experimental section to be convincing. No circular math or self-referential claims appear, and the work stays empirical. This paper is aimed at practitioners building on-device retrieval systems rather than theorists. Readers working on mobile or edge AI pipelines would get concrete ideas from it, even if they have to verify the numbers themselves. I would send it to peer review. The problem is real, the proposed fix is simple enough to test, and referees can check whether the experiments close the gap between the claim and the implementation.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a unified model for on-device Retrieval-Augmented Generation (RAG) that employs a single learned representation to handle both document retrieval and context compression. The central empirical claim is that this shared approach matches the performance of a traditional RAG reader while using an average of only 1/10 of the context length, without increasing storage requirements relative to multi-vector retrieval baselines, and that it is the first model to unify these functions.

Significance. If the reported performance equivalence holds under rigorous evaluation, the work offers a practical advance for privacy-preserving, low-latency on-device RAG by reducing both KV-cache memory during generation and disk usage for embeddings. The consolidation of retrieval and compression into one model and representation is a clear strength and could guide further efficiency work in resource-constrained settings.

major comments (1)

[§5] §5 (Experiments): the claim that performance matches a traditional RAG reader at 1/10 context size is central yet presented without reported baselines, datasets, ablation controls, or error bars in the abstract; the full experimental section must supply these details (including exact multi-vector comparators and statistical tests) to substantiate that the shared representation incurs no hidden quality or efficiency cost.

minor comments (1)

Abstract: the phrasing 'without increasing storage requirements compared to a multi-vector retrieval model' would be clearer if it explicitly stated the storage metric (e.g., bytes per document or embedding dimension) used for the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our unified model for on-device RAG. We appreciate the recognition that consolidating retrieval and context compression into a single representation offers a practical advance. We address the major comment on the experimental section below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§5] §5 (Experiments): the claim that performance matches a traditional RAG reader at 1/10 context size is central yet presented without reported baselines, datasets, ablation controls, or error bars in the abstract; the full experimental section must supply these details (including exact multi-vector comparators and statistical tests) to substantiate that the shared representation incurs no hidden quality or efficiency cost.

Authors: We agree that rigorous experimental details are essential to substantiate the central claim. The abstract provides only a high-level summary, as is standard. The full §5 of the manuscript already describes the evaluation datasets, ablation controls on the shared representation, and comparisons to multi-vector retrieval baselines while reporting storage usage. To further strengthen the evidence that the unified approach incurs no hidden quality or efficiency costs, we will revise the section to include error bars on all metrics, precise specifications of the multi-vector comparators (including their exact configurations), and statistical significance tests (e.g., paired t-tests) for the performance equivalence at reduced context size. These additions will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model proposal with no self-referential derivations

full rationale

The paper proposes a unified model for on-device RAG that shares representations for retrieval and context compression, claiming empirical performance parity at reduced context size. No equations, derivations, or first-principles predictions are presented in the abstract or described claims. All load-bearing assertions (e.g., matching traditional RAG with 1/10 context and no extra storage) are framed as experimental outcomes rather than reductions to fitted inputs or self-citations. The work is self-contained against external benchmarks via reported evaluations, with no self-definitional loops, fitted predictions renamed as results, or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all technical details deferred to full manuscript.

pith-pipeline@v0.9.0 · 5547 in / 1012 out tokens · 36969 ms · 2026-05-10T11:51:07.840526+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 39 canonical work pages · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Khatamifard, Minsik Cho, Carlo C

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. LLM in a flash: Efficient large language model inference with limited memory. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Ling...

work page doi:10.18653/v1/2024.acl-long.678 2024
[3]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werr...

work page internal anchor Pith review arXiv 2025
[4]

Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page arXiv 2022
[5]

Gte-moderncolbert, 2025

Antoine Chaffin. Gte-moderncolbert, 2025. URL https://huggingface.co/lightonai/GTE-ModernColBERT-v1

2025
[6]

Pylate: Flexible training and retrieval for late interaction models

Antoine Chaffin and Rapha \" e l Sourty. Pylate: Flexible training and retrieval for late interaction models. In Meeyoung Cha, Chanyoung Park, Noseong Park, Carl Yang, Senjuti Basu Roy, Jessie Li, Jaap Kamps, Kijung Shin, Bryan Hooi, and Lifang He (eds.), Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 20...

work page doi:10.1145/3746252.3761608 2025
[7]

Accelerating mobile language model via speculative decoding and npu-coordinated execution

Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, and Yun Ma. Accelerating mobile language model via speculative decoding and npu-coordinated execution. CoRR, abs/2510.15312, 2025. doi:10.48550/ARXIV.2510.15312. URL https://doi.org/10.48550/arXiv.2510.15312

work page doi:10.48550/arxiv.2510.15312 2025
[8]

xrag: Extreme context compression for retrieval-augmented generation with one token

Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si - Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Sy...

2024
[9]

Learning to compress prompt in natural language formats

Yu - Neng Chuang, Tianwei Xing, Chia - Yuan Chang, Zirui Liu, Xun Chen, and Xia Ben Hu. Learning to compress prompt in natural language formats. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page doi:10.18653/v1/2024.naacl-long.429 2024
[10]

Corpus subsampling: Estimating the effectiveness of neural retrieval models on large corpora

Maik Fr \" o be, Andrew Parry, Harrisen Scells, Shuai Wang, Shengyao Zhuang, Guido Zuccon, Martin Potthast, and Matthias Hagen. Corpus subsampling: Estimating the effectiveness of neural retrieval models on large corpora. In Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicol...

work page doi:10.1007/978-3-031-88708-6 2025
[11]

SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Marie - Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen - tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 20...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[12]

In-context autoencoder for context compression in a large language model

Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si - Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=uREj4ZuGJE

2024
[13]

Gemma 3 Technical Report

Gemma-Team. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Enhancing RAG efficiency with adaptive context compression

Shuyu Guo and Zhaochun Ren. Enhancing RAG efficiency with adaptive context compression. CoRR, abs/2507.22931, 2025. doi:10.48550/ARXIV.2507.22931. URL https://doi.org/10.48550/arXiv.2507.22931

work page doi:10.48550/arxiv.2507.22931 2025
[15]

CLaRa: Bridging retrieval and generation with continuous latent reasoning.arXiv preprint arXiv:2511.18659, 2025

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, and Yizhe Zhang. Clara: Bridging retrieval and generation with continuous latent reasoning, 2026. URL https://arxiv.org/abs/2511.18659

work page arXiv 2026
[16]

Improving eﬀicient neural ranking models with cross-architecture knowledge distilla- tion

Sebastian Hofst \" a tter, Sophia Althammer, Michael Schr \" o der, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. CoRR, abs/2010.02666, 2020. URL https://arxiv.org/abs/2010.02666

work page arXiv 2010
[17]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

Huiqiang Jiang, Qianhui Wu, Chin - Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 13358--13376...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[18]

In: Long, G., Blumestein, M., Chang, Y., Lewin-Eytan, L., Huang, Z.H., Yom-Tov, E

Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. In Companion Proceedings of the ACM on Web Conference 2025, WWW '25, pp.\ 737–740, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 97984...

work page doi:10.1145/3701716.3715313 2025
[19]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, Vancouver, Canada, July 2017....

work page doi:10.18653/v1/p17-1147 2017
[20]

Discrete prompt compression with reinforcement learning

Hoyoun Jung and Kyung - Joong Kim. Discrete prompt compression with reinforcement learning. IEEE Access , 12: 0 72578--72587, 2024. doi:10.1109/ACCESS.2024.3403426. URL https://doi.org/10.1109/ACCESS.2024.3403426

work page doi:10.1109/access.2024.3403426 2024
[21]

Pocket rag: On-device rag for first aid guidance in offline mobile environment, 2026

Dong Ho Kang, Hyunjoon Lee, Hyeonjeong Cha, Minkyu Choi, and Sungsoo Lim. Pocket rag: On-device rag for first aid guidance in offline mobile environment, 2026. URL https://arxiv.org/abs/2602.13229

work page arXiv 2026
[22]

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6769--6781. Association for Computational Linguistics, 2020. doi:10.18653/v1/20...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[23]

ColBERT: Efficient and effective passage search via con- textualized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over BERT . In Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji - Rong Wen, and Yiqun Liu (eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 20...

work page doi:10.1145/3397271.3401075 2020
[24]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page doi:10.1162/tacl_a_00276 2019
[25]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...

2020
[26]

Making large language models a better foundation for dense retrieval, 2023 a

Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Making large language models a better foundation for dense retrieval, 2023 a

2023
[27]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023).https://doi.org/ 10.18653/v1/2023.emnlp-main.391

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 6342--6353. Association for Computa...

work page doi:10.18653/v1/2023.emnlp-main.391 2023
[28]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023 c

work page internal anchor Pith review arXiv 2023
[29]

500xcompressor: Generalized prompt compression for large language models

Zongqian Li, Yixuan Su, and Nigel Collier. 500xcompressor: Generalized prompt compression for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - Aug...

2025
[30]

Pyserini : A Python toolkit for reproducible information retrieval research with sparse and dense representations

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini : A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pp.\ 2356--...

2021
[31]

PISCO : Pretty simple compression for retrieval-augmented generation

Maxime Louis, Herv \'e D \'e jean, and St \'e phane Clinchant. PISCO : Pretty simple compression for retrieval-augmented generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 15506--15521, Vienna, Austria, July 2025. Association for Compu...

work page doi:10.18653/v1/2025.findings-acl.800 2025
[32]

Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

2023
[33]

Generative representational instruction tuning

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=BC4lIvfSzv

2025
[34]

Genai at the edge: Comprehensive survey on empowering edge devices

Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, and Tinoosh Mohsenin. Genai at the edge: Comprehensive survey on empowering edge devices. In Ron P. A. Petrick and Christopher W. Geib (eds.), Proceedings of the 2025 AAAI Spring Symposium Series, San Francisco, CA, USA, March 31-April 2, 2025 , pp.\ 180--187. AAAI Press, 2025. d...

work page doi:10.1609/aaaiss.v5i1.35586 2025
[35]

In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R \" u hle, Yuqing Yang, Chin - Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Associat...

work page doi:10.18653/v1/2024.findings-acl.57 2024
[36]

Mobilerag: A fast, memory-efficient, and energy-efficient method for on-device RAG

Taehwan Park, Geonho Lee, and Min - Soo Kim. Mobilerag: A fast, memory-efficient, and energy-efficient method for on-device RAG . CoRR, abs/2507.01079, 2025. doi:10.48550/ARXIV.2507.01079. URL https://doi.org/10.48550/arXiv.2507.01079

work page doi:10.48550/arxiv.2507.01079 2025
[37]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp.\ 8024--8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

2019
[38]

Context embeddings for efficient answer generation in RAG

David Rau, Shuai Wang, Herv \' e D \' e jean, and St \' e phane Clinchant. Context embeddings for efficient answer generation in RAG . CoRR, abs/2407.09252, 2024. doi:10.48550/ARXIV.2407.09252. URL https://doi.org/10.48550/arXiv.2407.09252

work page doi:10.48550/arxiv.2407.09252 2024
[39]

S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '94, pp.\ 232–241, Berlin, Heidelberg, 1994. Springer-Verlag. ISBN 038719889X

1994
[40]

TACO-RL: task aware prompt compression optimization with reinforcement learning

Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor R \" u hle, and Saravan Rajmohan. TACO-RL: task aware prompt compression optimization with reinforcement learning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics, ACL ...

2025
[41]

DIRC-RAG: accelerating edge RAG with robust high-density and high-loading-bandwidth digital in-reram computation

Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Kwang - Ting (Tim) Cheng, and Chi - Ying Tsui. DIRC-RAG: accelerating edge RAG with robust high-density and high-loading-bandwidth digital in-reram computation. In IEEE/ACM International Symposium on Low Power Electronics and Design,...

work page doi:10.1109/islped65674.2025.11261807 2025
[42]

Powerinfer: Fast large language model serving with a consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade GPU . In Emmett Witchel, Christopher J. Rossbach, Andrea C. Arpaci - Dusseau, and Kimberly Keeton (eds.), Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP 2024, Austin, TX, USA, November 4-6, 2024 , pp.\ 5...

work page doi:10.1145/3694715.3695964 2024
[43]

Hard negatives, hard lessons: Revisiting training data quality for robust information retrieval with LLM s

Nandan Thakur, Crystina Zhang, Xueguang Ma, and Jimmy Lin. Hard negatives, hard lessons: Revisiting training data quality for robust information retrieval with LLM s. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 9064--9083, Suzhou, Chin...

work page doi:10.18653/v1/2025.findings-emnlp.481 2025
[44]

Representation Learning with Contrastive Predictive Coding

A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

2017
[46]

Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =

Zhaode Wang, Jingbang Yang, Xinyu Qian, Shiwen Xing, Xiaotang Jiang, Chengfei Lv, and Shengyu Zhang. MNN-LLM: A generic inference engine for fast large language model deployment on mobile devices. In Ruili Wang, Zhiyong Wang, Jiaying Liu, Alberto Del Bimbo, Jun Zhou, Anup Basu, and Min Xu (eds.), Proceedings of the 6th ACM International Conference on Mult...

work page doi:10.1145/3700410.3702126 2024
[47]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavi \' e , Orion Weller, Oskar Hallstr \" o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. ...

2025
[48]

Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models

David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pp.\ 56...

work page doi:10.18653/v1/2022.findings-emnlp.412 2022
[49]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \' e mi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface's transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[50]

arXiv preprint arXiv:2309.04255, 2023

Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, 2023. doi:10.48550/ARXIV.2309.04255. URL https://doi.org/10.48550/arXiv.2309.04255

work page doi:10.48550/arxiv.2309.04255 2023
[51]

RECOMP: improving retrieval-augmented lms with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp

2024
[52]

Zamani, F

Hamed Zamani, Fernando Diaz, Mostafa Dehghani, Donald Metzler, and Michael Bendersky. Retrieval-enhanced machine learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pp.\ 2875–2886, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi:10.11...

work page doi:10.1145/3477495.3531722 2022
[53]

GEM: empowering LLM for both embedding generation and language understanding

Caojin Zhang, Qiang Zhang, Ke Li, Sai Vidyaranya Nuthalapati, Benyu Zhang, Jason Liu, Serena Li, Lizhu Zhang, and Xiangjun Fan. GEM: empowering LLM for both embedding generation and language understanding. CoRR, abs/2506.04344, 2025. doi:10.48550/ARXIV.2506.04344. URL https://doi.org/10.48550/arXiv.2506.04344

work page doi:10.48550/arxiv.2506.04344 2025
[54]

Findings of the

Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, and Ningyu Zhang. O ne G en: Efficient one-pass unified generation and retrieval for LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4088--4119, Miami, Florida...

work page doi:10.18653/v1/2024.findings-emnlp.237 2024
[55]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[56]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[57]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv