Recognition: unknown
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3
The pith
A single model and representation can handle both document retrieval and context compression for on-device RAG while matching full-context performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a unified model that produces a shared document representation usable for both retrieving relevant passages and compressing the retrieved content into a short context for the generator. On standard benchmarks this yields performance on par with conventional RAG pipelines while using an average of 1/10 the context size and without raising storage costs above those of a multi-vector retrieval model.
What carries the argument
The unified model whose single learned document representation supports both retrieval scoring and context compression for generation.
If this is right
- On-device pipelines become practical for personal data without internet or external servers.
- KV cache and attention memory demands on the generative model drop sharply because far less context is supplied.
- Disk space stays equivalent to a multi-vector retriever since no extra embeddings are stored.
- The same representation can replace two separate components in future on-device systems.
Where Pith is reading between the lines
- The unification approach could be extended to merge additional on-device tasks such as summarization into the same representation.
- Real-world tests on mobile hardware would reveal actual latency and energy savings beyond benchmark numbers.
- Similar shared-representation designs might apply to other resource-constrained settings where retrieval and generation compete for memory.
Load-bearing premise
A single learned representation can simultaneously support high-quality retrieval and effective context compression without substantial quality loss under tight on-device memory limits.
What would settle it
A side-by-side evaluation on a standard RAG benchmark in which the unified model's generation accuracy falls measurably below that of a traditional full-context RAG system despite the reduced context length.
Figures
read the original abstract
Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a unified model for on-device Retrieval-Augmented Generation (RAG) that employs a single learned representation to handle both document retrieval and context compression. The central empirical claim is that this shared approach matches the performance of a traditional RAG reader while using an average of only 1/10 of the context length, without increasing storage requirements relative to multi-vector retrieval baselines, and that it is the first model to unify these functions.
Significance. If the reported performance equivalence holds under rigorous evaluation, the work offers a practical advance for privacy-preserving, low-latency on-device RAG by reducing both KV-cache memory during generation and disk usage for embeddings. The consolidation of retrieval and compression into one model and representation is a clear strength and could guide further efficiency work in resource-constrained settings.
major comments (1)
- [§5] §5 (Experiments): the claim that performance matches a traditional RAG reader at 1/10 context size is central yet presented without reported baselines, datasets, ablation controls, or error bars in the abstract; the full experimental section must supply these details (including exact multi-vector comparators and statistical tests) to substantiate that the shared representation incurs no hidden quality or efficiency cost.
minor comments (1)
- Abstract: the phrasing 'without increasing storage requirements compared to a multi-vector retrieval model' would be clearer if it explicitly stated the storage metric (e.g., bytes per document or embedding dimension) used for the comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the significance of our unified model for on-device RAG. We appreciate the recognition that consolidating retrieval and context compression into a single representation offers a practical advance. We address the major comment on the experimental section below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): the claim that performance matches a traditional RAG reader at 1/10 context size is central yet presented without reported baselines, datasets, ablation controls, or error bars in the abstract; the full experimental section must supply these details (including exact multi-vector comparators and statistical tests) to substantiate that the shared representation incurs no hidden quality or efficiency cost.
Authors: We agree that rigorous experimental details are essential to substantiate the central claim. The abstract provides only a high-level summary, as is standard. The full §5 of the manuscript already describes the evaluation datasets, ablation controls on the shared representation, and comparisons to multi-vector retrieval baselines while reporting storage usage. To further strengthen the evidence that the unified approach incurs no hidden quality or efficiency costs, we will revise the section to include error bars on all metrics, precise specifications of the multi-vector comparators (including their exact configurations), and statistical significance tests (e.g., paired t-tests) for the performance equivalence at reduced context size. These additions will be incorporated in the revised manuscript. revision: yes
Circularity Check
No significant circularity; empirical model proposal with no self-referential derivations
full rationale
The paper proposes a unified model for on-device RAG that shares representations for retrieval and context compression, claiming empirical performance parity at reduced context size. No equations, derivations, or first-principles predictions are presented in the abstract or described claims. All load-bearing assertions (e.g., matching traditional RAG with 1/10 context and no extra storage) are framed as experimental outcomes rather than reductions to fitted inputs or self-citations. The work is self-contained against external benchmarks via reported evaluations, with no self-definitional loops, fitted predictions renamed as results, or uniqueness theorems imported from prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Khatamifard, Minsik Cho, Carlo C
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. LLM in a flash: Efficient large language model inference with limited memory. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Ling...
-
[3]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werr...
work page internal anchor Pith review arXiv 2025
-
[4]
Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...
-
[5]
Gte-moderncolbert, 2025
Antoine Chaffin. Gte-moderncolbert, 2025. URL https://huggingface.co/lightonai/GTE-ModernColBERT-v1
2025
-
[6]
Pylate: Flexible training and retrieval for late interaction models
Antoine Chaffin and Rapha \" e l Sourty. Pylate: Flexible training and retrieval for late interaction models. In Meeyoung Cha, Chanyoung Park, Noseong Park, Carl Yang, Senjuti Basu Roy, Jessie Li, Jaap Kamps, Kijung Shin, Bryan Hooi, and Lifang He (eds.), Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 20...
-
[7]
Accelerating mobile language model via speculative decoding and npu-coordinated execution
Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, and Yun Ma. Accelerating mobile language model via speculative decoding and npu-coordinated execution. CoRR, abs/2510.15312, 2025. doi:10.48550/ARXIV.2510.15312. URL https://doi.org/10.48550/arXiv.2510.15312
-
[8]
xrag: Extreme context compression for retrieval-augmented generation with one token
Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si - Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Sy...
2024
-
[9]
Learning to compress prompt in natural language formats
Yu - Neng Chuang, Tianwei Xing, Chia - Yuan Chang, Zirui Liu, Xun Chen, and Xia Ben Hu. Learning to compress prompt in natural language formats. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...
-
[10]
Corpus subsampling: Estimating the effectiveness of neural retrieval models on large corpora
Maik Fr \" o be, Andrew Parry, Harrisen Scells, Shuai Wang, Shengyao Zhuang, Guido Zuccon, Martin Potthast, and Matthias Hagen. Corpus subsampling: Estimating the effectiveness of neural retrieval models on large corpora. In Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicol...
-
[11]
SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Marie - Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen - tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 20...
-
[12]
In-context autoencoder for context compression in a large language model
Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si - Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=uREj4ZuGJE
2024
-
[13]
Gemma-Team. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Enhancing RAG efficiency with adaptive context compression
Shuyu Guo and Zhaochun Ren. Enhancing RAG efficiency with adaptive context compression. CoRR, abs/2507.22931, 2025. doi:10.48550/ARXIV.2507.22931. URL https://doi.org/10.48550/arXiv.2507.22931
-
[15]
Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, and Yizhe Zhang. Clara: Bridging retrieval and generation with continuous latent reasoning, 2026. URL https://arxiv.org/abs/2511.18659
-
[16]
Improving efficient neural ranking models with cross-architecture knowledge distilla- tion
Sebastian Hofst \" a tter, Sophia Althammer, Michael Schr \" o der, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. CoRR, abs/2010.02666, 2020. URL https://arxiv.org/abs/2010.02666
-
[17]
Huiqiang Jiang, Qianhui Wu, Chin - Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 13358--13376...
-
[18]
In: Long, G., Blumestein, M., Chang, Y., Lewin-Eytan, L., Huang, Z.H., Yom-Tov, E
Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. In Companion Proceedings of the ACM on Web Conference 2025, WWW '25, pp.\ 737–740, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 97984...
-
[19]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, Vancouver, Canada, July 2017....
-
[20]
Discrete prompt compression with reinforcement learning
Hoyoun Jung and Kyung - Joong Kim. Discrete prompt compression with reinforcement learning. IEEE Access , 12: 0 72578--72587, 2024. doi:10.1109/ACCESS.2024.3403426. URL https://doi.org/10.1109/ACCESS.2024.3403426
-
[21]
Pocket rag: On-device rag for first aid guidance in offline mobile environment, 2026
Dong Ho Kang, Hyunjoon Lee, Hyeonjeong Cha, Minkyu Choi, and Sungsoo Lim. Pocket rag: On-device rag for first aid guidance in offline mobile environment, 2026. URL https://arxiv.org/abs/2602.13229
-
[22]
Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6769--6781. Association for Computational Linguistics, 2020. doi:10.18653/v1/20...
-
[23]
ColBERT: Efficient and effective passage search via con- textualized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over BERT . In Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji - Rong Wen, and Yiqun Liu (eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 20...
-
[24]
and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
-
[25]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...
2020
-
[26]
Making large language models a better foundation for dense retrieval, 2023 a
Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Making large language models a better foundation for dense retrieval, 2023 a
2023
-
[27]
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 6342--6353. Association for Computa...
-
[28]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023 c
work page internal anchor Pith review arXiv 2023
-
[29]
500xcompressor: Generalized prompt compression for large language models
Zongqian Li, Yixuan Su, and Nigel Collier. 500xcompressor: Generalized prompt compression for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - Aug...
2025
-
[30]
Pyserini : A Python toolkit for reproducible information retrieval research with sparse and dense representations
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini : A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pp.\ 2356--...
2021
-
[31]
PISCO : Pretty simple compression for retrieval-augmented generation
Maxime Louis, Herv \'e D \'e jean, and St \'e phane Clinchant. PISCO : Pretty simple compression for retrieval-augmented generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 15506--15521, Vienna, Austria, July 2025. Association for Compu...
-
[32]
Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...
2023
-
[33]
Generative representational instruction tuning
Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=BC4lIvfSzv
2025
-
[34]
Genai at the edge: Comprehensive survey on empowering edge devices
Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, and Tinoosh Mohsenin. Genai at the edge: Comprehensive survey on empowering edge devices. In Ron P. A. Petrick and Christopher W. Geib (eds.), Proceedings of the 2025 AAAI Spring Symposium Series, San Francisco, CA, USA, March 31-April 2, 2025 , pp.\ 180--187. AAAI Press, 2025. d...
-
[35]
In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor R \" u hle, Yuqing Yang, Chin - Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Associat...
-
[36]
Mobilerag: A fast, memory-efficient, and energy-efficient method for on-device RAG
Taehwan Park, Geonho Lee, and Min - Soo Kim. Mobilerag: A fast, memory-efficient, and energy-efficient method for on-device RAG . CoRR, abs/2507.01079, 2025. doi:10.48550/ARXIV.2507.01079. URL https://doi.org/10.48550/arXiv.2507.01079
-
[37]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp.\ 8024--8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
2019
-
[38]
Context embeddings for efficient answer generation in RAG
David Rau, Shuai Wang, Herv \' e D \' e jean, and St \' e phane Clinchant. Context embeddings for efficient answer generation in RAG . CoRR, abs/2407.09252, 2024. doi:10.48550/ARXIV.2407.09252. URL https://doi.org/10.48550/arXiv.2407.09252
-
[39]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '94, pp.\ 232–241, Berlin, Heidelberg, 1994. Springer-Verlag. ISBN 038719889X
1994
-
[40]
TACO-RL: task aware prompt compression optimization with reinforcement learning
Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor R \" u hle, and Saravan Rajmohan. TACO-RL: task aware prompt compression optimization with reinforcement learning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics, ACL ...
2025
-
[41]
Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Kwang - Ting (Tim) Cheng, and Chi - Ying Tsui. DIRC-RAG: accelerating edge RAG with robust high-density and high-loading-bandwidth digital in-reram computation. In IEEE/ACM International Symposium on Low Power Electronics and Design,...
-
[42]
Powerinfer: Fast large language model serving with a consumer-grade GPU
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade GPU . In Emmett Witchel, Christopher J. Rossbach, Andrea C. Arpaci - Dusseau, and Kimberly Keeton (eds.), Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP 2024, Austin, TX, USA, November 4-6, 2024 , pp.\ 5...
-
[43]
Nandan Thakur, Crystina Zhang, Xueguang Ma, and Jimmy Lin. Hard negatives, hard lessons: Revisiting training data quality for robust information retrieval with LLM s. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 9064--9083, Suzhou, Chin...
-
[44]
Representation Learning with Contrastive Predictive Coding
A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...
2017
-
[46]
Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops , articleno =
Zhaode Wang, Jingbang Yang, Xinyu Qian, Shiwen Xing, Xiaotang Jiang, Chengfei Lv, and Shengyu Zhang. MNN-LLM: A generic inference engine for fast large language model deployment on mobile devices. In Ruili Wang, Zhiyong Wang, Jiaying Liu, Alberto Del Bimbo, Jun Zhou, Anup Basu, and Min Xu (eds.), Proceedings of the 6th ACM International Conference on Mult...
-
[47]
Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference
Benjamin Warner, Antoine Chaffin, Benjamin Clavi \' e , Orion Weller, Oskar Hallstr \" o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. ...
2025
-
[48]
David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pp.\ 56...
-
[49]
Transformers: State-of-the-Art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \' e mi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface's transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp...
-
[50]
arXiv preprint arXiv:2309.04255, 2023
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, 2023. doi:10.48550/ARXIV.2309.04255. URL https://doi.org/10.48550/arXiv.2309.04255
-
[51]
RECOMP: improving retrieval-augmented lms with context compression and selective augmentation
Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp
2024
-
[52]
Hamed Zamani, Fernando Diaz, Mostafa Dehghani, Donald Metzler, and Michael Bendersky. Retrieval-enhanced machine learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pp.\ 2875–2886, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi:10.11...
-
[53]
GEM: empowering LLM for both embedding generation and language understanding
Caojin Zhang, Qiang Zhang, Ke Li, Sai Vidyaranya Nuthalapati, Benyu Zhang, Jason Liu, Serena Li, Lizhu Zhang, and Xiangjun Fan. GEM: empowering LLM for both embedding generation and language understanding. CoRR, abs/2506.04344, 2025. doi:10.48550/ARXIV.2506.04344. URL https://doi.org/10.48550/arXiv.2506.04344
-
[54]
Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, and Ningyu Zhang. O ne G en: Efficient one-pass unified generation and retrieval for LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4088--4119, Miami, Florida...
-
[55]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[56]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[57]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.