arxiv: 2301.12652 · v4 · pith:E3E5YQVFnew · submitted 2023-01-30 · 💻 cs.CL

REPLUG: Retrieval-Augmented Black-Box Language Models

Weijia Shi , Sewon Min , Michihiro Yasunaga , Minjoon Seo , Rich James , Mike Lewis , Luke Zettlemoyer , Wen-tau Yih This is my paper

Pith reviewed 2026-05-17 12:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented language modelingblack-box language modelsGPT-3retriever tuninglanguage modelingfew-shot learningMMLU

0 comments

The pith

REPLUG augments frozen black-box LMs like GPT-3 with a tunable retriever by prepending documents and training the retriever on the LM's own predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents REPLUG as a retrieval-augmented framework that leaves the language model unchanged and instead tunes a separate retriever to select documents. These documents are simply prepended to the input so the black-box LM can use them for better next-token predictions. The retriever learns from the LM itself, which scores how much each document helps reduce prediction loss. This design works with any existing LM and retriever pair without requiring new cross-attention layers or joint training. Experiments report concrete gains on language modeling and few-shot tasks, showing the approach can lift performance of very large fixed models.

Core claim

REPLUG treats the language model as a black box and augments it by prepending documents retrieved by a tuneable model. The LM itself supervises the retriever by providing signals that indicate which documents improve its predictions. This yields a 6.3% improvement on language modeling for GPT-3 (175B) and a 5.1% gain on five-shot MMLU for Codex.

What carries the argument

The REPLUG framework, which prepends documents from a tuneable retriever to the input of a frozen LM and uses the LM's prediction loss to supervise retriever training.

If this is right

The method applies to any existing LM and retriever without special cross-attention training.
Performance on language modeling for GPT-3 (175B) rises by 6.3%.
Five-shot accuracy on MMLU for Codex rises by 5.1%.
No need to retrain or modify the underlying language model to obtain the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LM-supervised retriever tuning might extend to other external knowledge sources such as knowledge graphs or APIs.
Closed API-only models could gain retrieval benefits if the retriever runs externally and only the input prefix is supplied.
Scaling the retriever independently of the LM size could become a separate efficiency lever for very large models.

Load-bearing premise

The frozen language model can supply reliable supervision signals that identify documents genuinely helpful for its own predictions without introducing bias or needing task labels.

What would settle it

If retraining the retriever on random or unhelpful documents eliminates the reported gains on GPT-3 language modeling and Codex MMLU, the value of LM-based supervision would be refuted.

read the original abstract

We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model. Unlike prior retrieval-augmented LMs that train language models with special cross attention mechanisms to encode the retrieved text, REPLUG simply prepends retrieved documents to the input for the frozen black-box LM. This simple design can be easily applied to any existing retrieval and language models. Furthermore, we show that the LM can be used to supervise the retrieval model, which can then find documents that help the LM make better predictions. Our experiments demonstrate that REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REPLUG gives a practical black-box way to add retrieval to frozen LMs by prepending docs and using the LM itself to train the retriever, with reported gains on GPT-3 and Codex.

read the letter

The main thing here is a straightforward engineering move: keep the language model frozen, prepend retrieved documents to its input, and train a separate retriever using the LM's own prediction quality as the training signal. This avoids any need to modify the LM with cross-attention or joint training, which sets it apart from earlier retrieval-augmented models that bake the retrieval step into the LM itself. The reported results show a 6.3% lift on language modeling for GPT-3 and a 5.1% lift on five-shot MMLU for Codex, which is the sort of number that matters for people already running these large models in production. The black-box framing is the real practical contribution, since it lets you apply the idea to any existing LM without retraining costs. The approach is model-agnostic and focuses tuning effort only on the retriever, which is a clear engineering advantage over methods that require changing the core model. The soft spot is the supervision mechanism. Training the retriever on signals from the LM on the target sequences risks optimistic bias if the documents or contexts used during retriever training overlap with those in the final evaluation. The abstract mentions held-out tasks, but without explicit confirmation of a fully disjoint corpus slice for the supervision stage or a control run on completely separate data, the gains could be inflated. If the full paper demonstrates clean separation and includes proper baselines with error bars, that concern drops away; otherwise it needs addressing. This paper is aimed at practitioners who want to improve off-the-shelf large LMs with retrieval in a plug-and-play fashion. Readers working on applied augmentation of black-box systems will get direct value from the method and the numbers on substantial models. It deserves a serious referee because the core idea is simple, the empirical claims are on real-scale models, and the framing is distinct enough from prior work to merit checking. I would send it out for peer review and specifically ask for details on the data splits used for retriever supervision plus full experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces REPLUG, a retrieval-augmented framework for black-box LMs such as GPT-3 and Codex. Retrieved documents are prepended to the input of a frozen LM; the LM itself supplies the supervision signal (via log-probability or perplexity) to train a tunable retriever. Experiments report that the tuned retriever yields a 6.3% improvement on language modeling for GPT-3 (175B) and a 5.1% improvement on five-shot MMLU for Codex.

Significance. If the gains prove robust and free of supervision-induced bias, the work demonstrates a lightweight, architecture-agnostic way to retrofit retrieval into existing large frozen models. This is practically significant because it avoids the cost of retraining or modifying the LM parameters and cross-attention layers required by prior retrieval-augmented LMs.

major comments (2)

[§3] §3 (Retriever Training): The supervision procedure uses the frozen LM’s own log-probabilities on target tokens to score candidate documents. The manuscript does not state whether the documents scored during retriever training are drawn from a corpus slice strictly disjoint from the evaluation sets used for the final LM and MMLU numbers. Without an explicit held-out split or a control experiment on a disjoint corpus, the reported 6.3% and 5.1% gains risk optimistic bias.
[§4] §4 (Experiments): The headline improvements are given as single percentage figures with no error bars, no number of random seeds, and no statistical significance tests. Table or figure reporting the per-task or per-period breakdowns should include these quantities so that the reader can judge whether the gains are stable.

minor comments (2)

[§2] The notation distinguishing the retrieval model parameters from the frozen LM parameters could be introduced earlier and used consistently.
[Figure 1] Figure 1 caption should explicitly list the exact prompt format used when prepending retrieved documents to the black-box LM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve clarity and robustness.

read point-by-point responses

Referee: [§3] §3 (Retriever Training): The supervision procedure uses the frozen LM’s own log-probabilities on target tokens to score candidate documents. The manuscript does not state whether the documents scored during retriever training are drawn from a corpus slice strictly disjoint from the evaluation sets used for the final LM and MMLU numbers. Without an explicit held-out split or a control experiment on a disjoint corpus, the reported 6.3% and 5.1% gains risk optimistic bias.

Authors: We agree that explicit confirmation of disjoint data is necessary to eliminate any concern of optimistic bias. The retriever is trained on documents drawn from a standard retrieval corpus (Wikipedia and Common Crawl slices) that does not overlap with the held-out evaluation sets used for language modeling (Pile test split) or MMLU (official test set). We will revise §3 to state the exact corpus sources and splits used for retriever training versus final evaluation, thereby making the separation explicit. revision: yes
Referee: [§4] §4 (Experiments): The headline improvements are given as single percentage figures with no error bars, no number of random seeds, and no statistical significance tests. Table or figure reporting the per-task or per-period breakdowns should include these quantities so that the reader can judge whether the gains are stable.

Authors: We acknowledge that variability measures would strengthen the results. Because of the prohibitive cost of repeated queries to 175B-scale black-box models, the primary numbers reflect single runs. We will add a note on this limitation and, where computationally feasible, report standard deviations from repeated runs on smaller models or task subsets. We will also expand the per-task and per-period tables to include these quantities and any applicable significance tests. revision: partial

Circularity Check

0 steps flagged

No significant circularity; LM supervision uses held-out splits for retriever tuning

full rationale

The paper's core derivation uses the frozen LM's log-probabilities on target tokens to supervise retriever training, then prepends retrieved documents to the same LM at inference. This does not reduce to a self-definition or fitted-input prediction by construction because the reported gains (6.3% on GPT-3 LM, 5.1% on Codex MMLU) are measured on explicitly held-out language-modeling and MMLU evaluation sets. No equations equate the final improvement to the supervision signal itself, and the method remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling. The supervision signal is independent of the final test contexts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an LM can serve as an effective supervisor for a separate retriever and that simple prepending suffices for integration.

axioms (1)

domain assumption The language model can be used to supervise the retrieval model to find documents that help it make better predictions.
This supervision step is required for the tuned retriever to deliver the reported gains.

pith-pipeline@v0.9.0 · 5469 in / 1155 out tokens · 50158 ms · 2026-05-17T12:36:41.264902+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Very Long-Term Conversational Memory of LLM Agents
cs.CL 2024-02 unverdicted novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
cs.CL 2025-10 unverdicted novelty 6.0

AtlasKV integrates billion-scale KGs into LLMs parametrically with sub-linear complexity and low memory by converting triples into key-value representations handled by the model's attention.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
cs.AI 2025-03 unverdicted novelty 6.0

R1-Searcher uses two-stage outcome-based RL to train LLMs to invoke external search systems for better reasoning without process rewards or distillation.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
cs.CL 2024-01 unverdicted novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
cs.LG 2026-04 unverdicted novelty 5.0

RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Towards General Text Embeddings with Multi-stage Contrastive Learning
cs.CL 2023-08 unverdicted novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[1]

International Conference on Machine Learning , pages=

Improving language models by retrieving from trillions of tokens , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[2]

5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , year=

Pointer sentinel mixture models , author=. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , year=

work page 2017
[3]

Meta AI , year=

Democratizing Access to large-scale language models with OPT-175B , author=. Meta AI , year=

work page
[4]

arXiv preprint arXiv:2110.04725 , year=

Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning , author=. arXiv preprint arXiv:2110.04725 , year=

work page arXiv
[5]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[6]

Younes Belkda, Tim Dettmers , title =

work page
[7]

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harrison Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and ...

work page 2021
[8]

International Conference on Machine Learning , pages=

Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[9]

Large Dual Encoders Are Generalizable Retrievers , year =

Jianmo Ni and Chen Qu and Jing Lu and Zhuyun Dai and Gustavo Hern. Large Dual Encoders Are Generalizable Retrievers , year =

work page
[10]

Prompting gpt-3 to be reliable

Prompting GPT-3 To Be Reliable , author=. arXiv preprint arXiv:2210.09150 , year=

work page arXiv
[11]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Mil- lican, and 1 others

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. arXiv preprint arXiv:2212.09741 , year=

work page arXiv
[13]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Empirical Methods in Natural Language Processing (EMNLP) , year=

Training Language Models with Memory Augmentation , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[16]

arXiv preprint arXiv:2212.01349 , year=

Nonparametric Masked Language Modeling , author=. arXiv preprint arXiv:2212.01349 , year=

work page arXiv
[17]

International Conference on Learning Representations , year=

Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

work page
[18]

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Izacard, Gautier and Grave, Edouard , keywords =. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2007.01282 , url =

work page internal anchor Pith review doi:10.48550/arxiv.2007.01282 2020
[19]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , keywords =. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , publisher =. 2020 , copyright =. doi:10.48550/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
[20]

arXiv preprint arXiv:2211.12561 , year=

Retrieval-Augmented Multimodal Language Modeling , author=. arXiv preprint arXiv:2211.12561 , year=

work page arXiv
[21]

Improving language models by retrieving from trillions of tokens

Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=

work page internal anchor Pith review arXiv
[22]

Calibrate Before Use: Improving Few-Shot Performance of Language Models , author=

work page
[23]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[24]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , journal =. 2020 , url =. 2005.14165 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Zhong, Ruiqi and Lee, Kristy and Zhang, Zheng and Klein, Dan. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.244

work page doi:10.18653/v1/2021.findings-emnlp.244 2021
[26]

Efficient Nearest Neighbor Language Models

He, Junxian and Neubig, Graham and Berg-Kirkpatrick, Taylor. Efficient Nearest Neighbor Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.461

work page doi:10.18653/v1/2021.emnlp-main.461 2021
[27]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[28]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[29]

arXiv preprint arXiv:2110.15943 , year=

Metaicl: Learning to learn in context , author=. arXiv preprint arXiv:2110.15943 , year=

work page arXiv
[30]

arXiv preprint arXiv:2101.06804 , year=

What Makes Good In-Context Examples for GPT- 3 ? , author=. arXiv preprint arXiv:2101.06804 , year=

work page arXiv
[31]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity.arXiv preprint arXiv:2104.08786,

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity , author=. arXiv preprint arXiv:2104.08786 , year=

work page arXiv
[32]

arXiv preprint arXiv:2112.08633 , year=

Learning To Retrieve Prompts for In-Context Learning , author=. arXiv preprint arXiv:2112.08633 , year=

work page arXiv
[33]

arXiv preprint , year=

Noisy Channel Language Model Prompting for Few-Shot Text Classification , author=. arXiv preprint , year=

work page
[34]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

International Conference on Machine Learning , pages=

Retrieval augmented language model pre-training , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[36]

arXiv preprint arXiv:2201.12431 , year=

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval , author=. arXiv preprint arXiv:2201.12431 , year=

work page arXiv
[37]

Recognizing textual entailment: Rational, evaluation and approaches – Erratum , volume=

Dagan, Ido and Dolan, Bill and Magnini, Bernardo and Roth, Dan , year=. Recognizing textual entailment: Rational, evaluation and approaches – Erratum , volume=. Natural Language Engineering , publisher=. doi:10.1017/S1351324909990234 , number=

work page doi:10.1017/s1351324909990234
[38]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[39]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Efficient Nearest Neighbor Language Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[40]

ACL/IJCNLP (2) , crossref=

Xin Zheng and Zhirui Zhang and Junliang Guo and Shujian Huang and Boxing Chen and Weihua Luo and Jiajun Chen , title=. ACL/IJCNLP (2) , crossref=. 2021 , cdate=

work page 2021
[41]

International Conference on Learning Representations , year=

Nearest Neighbor Machine Translation , author=. International Conference on Learning Representations , year=

work page
[42]

Patrick S. H. Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela , title=. NeurIPS , crossref=. 2020 , cdate=

work page 2020
[43]

proceedings of the 25th international conference on world wide web , pages=

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering , author=. proceedings of the 25th international conference on world wide web , pages=

work page
[44]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Surface Form Competition: Why the Highest Probability Answer Isn ' t Always Right

Holtzman, Ari and West, Peter and Shwartz, Vered and Choi, Yejin and Zettlemoyer, Luke. Surface Form Competition: Why the Highest Probability Answer Isn ' t Always Right. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.564

work page doi:10.18653/v1/2021.emnlp-main.564 2021
[47]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

Nearest Neighbor Zero-Shot Inference , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

work page 2022
[48]

2004 , isbn =

Hu, Minqing and Liu, Bing , title =. 2004 , isbn =. doi:10.1145/1014052.1014073 , booktitle =

work page doi:10.1145/1014052.1014073 2004
[49]

Advances in neural information processing systems , volume=

Character-level convolutional networks for text classification , author=. Advances in neural information processing systems , volume=

work page
[50]

Thirty-first AAAI conference on artificial intelligence , year=

Conceptnet 5.5: An open multilingual graph of general knowledge , author=. Thirty-first AAAI conference on artificial intelligence , year=

work page
[51]

proceedings of Sinn und Bedeutung , volume=

The commitmentbank: Investigating projection in naturally occurring discourse , author=. proceedings of Sinn und Bedeutung , volume=

work page
[52]

arXiv preprint arXiv:2108.02035 , year=

Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification , author=. arXiv preprint arXiv:2108.02035 , year=

work page arXiv
[53]

S im CSE : Simple Contrastive Learning of Sentence Embeddings

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi. S im CSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.552

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[54]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

work page
[55]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Ms marco: A human generated machine reading comprehension dataset , author=. arXiv preprint arXiv:1611.09268 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

KILT : a Benchmark for Knowledge Intensive Language Tasks

Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt. KILT : a Benchmark for Knowledge Intensive Language Tasks. Proceedings of the 2021 Conference of the North American Chapter of th...

work page doi:10.18653/v1/2021.naacl-main.200 2021
[58]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

work page 2022
[59]

S ent E val: An Evaluation Toolkit for Universal Sentence Representations

Conneau, Alexis and Kiela, Douwe. S ent E val: An Evaluation Toolkit for Universal Sentence Representations. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

work page 2018
[60]

V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure

Rosenberg, Andrew and Hirschberg, Julia. V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ). 2007

work page 2007
[61]

Efficient Natural Language Response Suggestion for Smart Reply

Efficient natural language response suggestion for smart reply , author=. arXiv preprint arXiv:1705.00652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

, author=

Exploring the limits of transfer learning with a unified text-to-text transformer. , author=. J. Mach. Learn. Res. , volume=

work page
[63]

M eta ICL : Learning to learn in context

Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh. M eta ICL : Learning to Learn In Context. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.201

work page doi:10.18653/v1/2022.naacl-main.201 2022
[64]

MTEB: Massive Text Embedding Benchmark

MTEB: Massive Text Embedding Benchmark , author=. arXiv preprint arXiv:2210.07316 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Transactions on Machine Learning Research , year=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=

work page
[66]

Semantic clustering and convolutional neural network for short text categorization , author=. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=

work page
[67]

IEEE Transactions on Big Data , volume=

Billion-scale similarity search with gpus , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019
[68]

2018 , url=

Non-Autoregressive Neural Machine Translation , author=. 2018 , url=

work page 2018
[69]

International Conference on Machine Learning , pages=

Retrieval-augmented reinforcement learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[70]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Learning To Retrieve Prompts for In-Context Learning , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2022
[71]

Khattab, K

Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP , author=. arXiv preprint arXiv:2212.14024 , year=

work page arXiv
[72]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511, 2022

When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories , author=. arXiv preprint arXiv:2212.10511 , year=

work page arXiv
[73]

Promptcap: Prompt-guided task- aware image captioning

PromptCap: Prompt-Guided Task-Aware Image Captioning , author=. arXiv preprint arXiv:2211.09699 , year=

work page arXiv
[74]

Prompting GPT-3 To Be Reliable , author=. Proc. of ICLR , year=

work page
[75]

Generate rather than retrieve: Large language models are strong context generators , author=. Proc. of ICLR , year=

work page
[76]

, author=

Mask-Predict: Parallel Decoding of Conditional Masked Language Models. , author=. 2019 , booktitle=

work page 2019
[77]

2019 , booktitle=

Levenshtein Transformer , author=. 2019 , booktitle=

work page 2019
[78]

Proc.\ of EMNLP , year=

Hint-Based Training for Non-Autoregressive Machine Translation , author=. Proc.\ of EMNLP , year=

work page
[79]

Fast Structured Decoding for Sequence Models , author=. Proc. of NeurIPS , year=

work page
[80]

Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , author=. Proc. of EMNLP , year=

work page

Showing first 80 references.