pith. machine review for the scientific record. sign in

arxiv: 2301.12652 · v4 · pith:E3E5YQVFnew · submitted 2023-01-30 · 💻 cs.CL

REPLUG: Retrieval-Augmented Black-Box Language Models

Pith reviewed 2026-05-17 12:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords retrieval-augmented language modelingblack-box language modelsGPT-3retriever tuninglanguage modelingfew-shot learningMMLU
0
0 comments X

The pith

REPLUG augments frozen black-box LMs like GPT-3 with a tunable retriever by prepending documents and training the retriever on the LM's own predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents REPLUG as a retrieval-augmented framework that leaves the language model unchanged and instead tunes a separate retriever to select documents. These documents are simply prepended to the input so the black-box LM can use them for better next-token predictions. The retriever learns from the LM itself, which scores how much each document helps reduce prediction loss. This design works with any existing LM and retriever pair without requiring new cross-attention layers or joint training. Experiments report concrete gains on language modeling and few-shot tasks, showing the approach can lift performance of very large fixed models.

Core claim

REPLUG treats the language model as a black box and augments it by prepending documents retrieved by a tuneable model. The LM itself supervises the retriever by providing signals that indicate which documents improve its predictions. This yields a 6.3% improvement on language modeling for GPT-3 (175B) and a 5.1% gain on five-shot MMLU for Codex.

What carries the argument

The REPLUG framework, which prepends documents from a tuneable retriever to the input of a frozen LM and uses the LM's prediction loss to supervise retriever training.

If this is right

  • The method applies to any existing LM and retriever without special cross-attention training.
  • Performance on language modeling for GPT-3 (175B) rises by 6.3%.
  • Five-shot accuracy on MMLU for Codex rises by 5.1%.
  • No need to retrain or modify the underlying language model to obtain the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LM-supervised retriever tuning might extend to other external knowledge sources such as knowledge graphs or APIs.
  • Closed API-only models could gain retrieval benefits if the retriever runs externally and only the input prefix is supplied.
  • Scaling the retriever independently of the LM size could become a separate efficiency lever for very large models.

Load-bearing premise

The frozen language model can supply reliable supervision signals that identify documents genuinely helpful for its own predictions without introducing bias or needing task labels.

What would settle it

If retraining the retriever on random or unhelpful documents eliminates the reported gains on GPT-3 language modeling and Codex MMLU, the value of LM-based supervision would be refuted.

read the original abstract

We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model. Unlike prior retrieval-augmented LMs that train language models with special cross attention mechanisms to encode the retrieved text, REPLUG simply prepends retrieved documents to the input for the frozen black-box LM. This simple design can be easily applied to any existing retrieval and language models. Furthermore, we show that the LM can be used to supervise the retrieval model, which can then find documents that help the LM make better predictions. Our experiments demonstrate that REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces REPLUG, a retrieval-augmented framework for black-box LMs such as GPT-3 and Codex. Retrieved documents are prepended to the input of a frozen LM; the LM itself supplies the supervision signal (via log-probability or perplexity) to train a tunable retriever. Experiments report that the tuned retriever yields a 6.3% improvement on language modeling for GPT-3 (175B) and a 5.1% improvement on five-shot MMLU for Codex.

Significance. If the gains prove robust and free of supervision-induced bias, the work demonstrates a lightweight, architecture-agnostic way to retrofit retrieval into existing large frozen models. This is practically significant because it avoids the cost of retraining or modifying the LM parameters and cross-attention layers required by prior retrieval-augmented LMs.

major comments (2)
  1. [§3] §3 (Retriever Training): The supervision procedure uses the frozen LM’s own log-probabilities on target tokens to score candidate documents. The manuscript does not state whether the documents scored during retriever training are drawn from a corpus slice strictly disjoint from the evaluation sets used for the final LM and MMLU numbers. Without an explicit held-out split or a control experiment on a disjoint corpus, the reported 6.3% and 5.1% gains risk optimistic bias.
  2. [§4] §4 (Experiments): The headline improvements are given as single percentage figures with no error bars, no number of random seeds, and no statistical significance tests. Table or figure reporting the per-task or per-period breakdowns should include these quantities so that the reader can judge whether the gains are stable.
minor comments (2)
  1. [§2] The notation distinguishing the retrieval model parameters from the frozen LM parameters could be introduced earlier and used consistently.
  2. [Figure 1] Figure 1 caption should explicitly list the exact prompt format used when prepending retrieved documents to the black-box LM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§3] §3 (Retriever Training): The supervision procedure uses the frozen LM’s own log-probabilities on target tokens to score candidate documents. The manuscript does not state whether the documents scored during retriever training are drawn from a corpus slice strictly disjoint from the evaluation sets used for the final LM and MMLU numbers. Without an explicit held-out split or a control experiment on a disjoint corpus, the reported 6.3% and 5.1% gains risk optimistic bias.

    Authors: We agree that explicit confirmation of disjoint data is necessary to eliminate any concern of optimistic bias. The retriever is trained on documents drawn from a standard retrieval corpus (Wikipedia and Common Crawl slices) that does not overlap with the held-out evaluation sets used for language modeling (Pile test split) or MMLU (official test set). We will revise §3 to state the exact corpus sources and splits used for retriever training versus final evaluation, thereby making the separation explicit. revision: yes

  2. Referee: [§4] §4 (Experiments): The headline improvements are given as single percentage figures with no error bars, no number of random seeds, and no statistical significance tests. Table or figure reporting the per-task or per-period breakdowns should include these quantities so that the reader can judge whether the gains are stable.

    Authors: We acknowledge that variability measures would strengthen the results. Because of the prohibitive cost of repeated queries to 175B-scale black-box models, the primary numbers reflect single runs. We will add a note on this limitation and, where computationally feasible, report standard deviations from repeated runs on smaller models or task subsets. We will also expand the per-task and per-period tables to include these quantities and any applicable significance tests. revision: partial

Circularity Check

0 steps flagged

No significant circularity; LM supervision uses held-out splits for retriever tuning

full rationale

The paper's core derivation uses the frozen LM's log-probabilities on target tokens to supervise retriever training, then prepends retrieved documents to the same LM at inference. This does not reduce to a self-definition or fitted-input prediction by construction because the reported gains (6.3% on GPT-3 LM, 5.1% on Codex MMLU) are measured on explicitly held-out language-modeling and MMLU evaluation sets. No equations equate the final improvement to the supervision signal itself, and the method remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling. The supervision signal is independent of the final test contexts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an LM can serve as an effective supervisor for a separate retriever and that simple prepending suffices for integration.

axioms (1)
  • domain assumption The language model can be used to supervise the retrieval model to find documents that help it make better predictions.
    This supervision step is required for the tuned retriever to deliver the reported gains.

pith-pipeline@v0.9.0 · 5469 in / 1155 out tokens · 50158 ms · 2026-05-17T12:36:41.264902+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Very Long-Term Conversational Memory of LLM Agents

    cs.CL 2024-02 unverdicted novelty 8.0

    Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

  2. Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs

    cs.CR 2026-05 unverdicted novelty 7.0

    PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.

  3. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  4. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  5. Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

  6. AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

    cs.CL 2025-10 unverdicted novelty 6.0

    AtlasKV integrates billion-scale KGs into LLMs parametrically with sub-linear complexity and low memory by converting triples into key-value representations handled by the model's attention.

  7. ZeroSearch: Incentivize the Search Capability of LLMs without Searching

    cs.CL 2025-05 conditional novelty 6.0

    ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...

  8. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    cs.AI 2025-03 unverdicted novelty 6.0

    R1-Searcher uses two-stage outcome-based RL to train LLMs to invoke external search systems for better reasoning without process rewards or distillation.

  9. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  10. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

    cs.CL 2024-01 unverdicted novelty 6.0

    RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

  11. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    cs.CL 2023-10 unverdicted novelty 6.0

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  12. Aligning Large Multimodal Models with Factually Augmented RLHF

    cs.CV 2023-09 conditional novelty 6.0

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  13. AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

    cs.AI 2026-05 unverdicted novelty 5.0

    AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.

  14. RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments

    cs.LG 2026-04 unverdicted novelty 5.0

    RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.

  15. Retrieval-Augmented Generation for AI-Generated Content: A Survey

    cs.CV 2024-02 accept novelty 5.0

    A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.

  16. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  17. Towards General Text Embeddings with Multi-stage Contrastive Learning

    cs.CL 2023-08 unverdicted novelty 5.0

    GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.

  18. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [1]

    International Conference on Machine Learning , pages=

    Improving language models by retrieving from trillions of tokens , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  2. [2]

    5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , year=

    Pointer sentinel mixture models , author=. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , year=

  3. [3]

    Meta AI , year=

    Democratizing Access to large-scale language models with OPT-175B , author=. Meta AI , year=

  4. [4]

    arXiv preprint arXiv:2110.04725 , year=

    Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning , author=. arXiv preprint arXiv:2110.04725 , year=

  5. [5]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  6. [6]

    Younes Belkda, Tim Dettmers , title =

  7. [7]

    Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harrison Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and ...

  8. [8]

    International Conference on Machine Learning , pages=

    Calibrate before use: Improving few-shot performance of language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  9. [9]

    Large Dual Encoders Are Generalizable Retrievers , year =

    Jianmo Ni and Chen Qu and Jing Lu and Zhuyun Dai and Gustavo Hern. Large Dual Encoders Are Generalizable Retrievers , year =

  10. [10]

    Prompting gpt-3 to be reliable

    Prompting GPT-3 To Be Reliable , author=. arXiv preprint arXiv:2210.09150 , year=

  11. [11]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

  12. [12]
  13. [13]

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Empirical Methods in Natural Language Processing (EMNLP) , year=

    Training Language Models with Memory Augmentation , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

  16. [16]

    arXiv preprint arXiv:2212.01349 , year=

    Nonparametric Masked Language Modeling , author=. arXiv preprint arXiv:2212.01349 , year=

  17. [17]

    International Conference on Learning Representations , year=

    Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

  18. [18]

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

    Izacard, Gautier and Grave, Edouard , keywords =. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2007.01282 , url =

  19. [19]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , keywords =. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , publisher =. 2020 , copyright =. doi:10.48550/...

  20. [20]

    arXiv preprint arXiv:2211.12561 , year=

    Retrieval-Augmented Multimodal Language Modeling , author=. arXiv preprint arXiv:2211.12561 , year=

  21. [21]

    Improving language models by retrieving from trillions of tokens

    Improving language models by retrieving from trillions of tokens , author=. arXiv preprint arXiv:2112.04426 , year=

  22. [22]

    Calibrate Before Use: Improving Few-Shot Performance of Language Models , author=

  23. [23]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  24. [24]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , journal =. 2020 , url =. 2005.14165 , timestamp =

  25. [25]

    Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

    Zhong, Ruiqi and Lee, Kristy and Zhang, Zheng and Klein, Dan. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.244

  26. [26]

    Efficient Nearest Neighbor Language Models

    He, Junxian and Neubig, Graham and Berg-Kirkpatrick, Taylor. Efficient Nearest Neighbor Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.461

  27. [27]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  28. [28]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  29. [29]

    arXiv preprint arXiv:2110.15943 , year=

    Metaicl: Learning to learn in context , author=. arXiv preprint arXiv:2110.15943 , year=

  30. [30]

    arXiv preprint arXiv:2101.06804 , year=

    What Makes Good In-Context Examples for GPT- 3 ? , author=. arXiv preprint arXiv:2101.06804 , year=

  31. [31]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity.arXiv preprint arXiv:2104.08786,

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity , author=. arXiv preprint arXiv:2104.08786 , year=

  32. [32]

    arXiv preprint arXiv:2112.08633 , year=

    Learning To Retrieve Prompts for In-Context Learning , author=. arXiv preprint arXiv:2112.08633 , year=

  33. [33]

    arXiv preprint , year=

    Noisy Channel Language Model Prompting for Few-Shot Text Classification , author=. arXiv preprint , year=

  34. [34]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

  35. [35]

    International Conference on Machine Learning , pages=

    Retrieval augmented language model pre-training , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  36. [36]

    arXiv preprint arXiv:2201.12431 , year=

    Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval , author=. arXiv preprint arXiv:2201.12431 , year=

  37. [37]

    Recognizing textual entailment: Rational, evaluation and approaches – Erratum , volume=

    Dagan, Ido and Dolan, Bill and Magnini, Bernardo and Roth, Dan , year=. Recognizing textual entailment: Rational, evaluation and approaches – Erratum , volume=. Natural Language Engineering , publisher=. doi:10.1017/S1351324909990234 , number=

  38. [38]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  39. [39]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Efficient Nearest Neighbor Language Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  40. [40]

    ACL/IJCNLP (2) , crossref=

    Xin Zheng and Zhirui Zhang and Junliang Guo and Shujian Huang and Boxing Chen and Weihua Luo and Jiajun Chen , title=. ACL/IJCNLP (2) , crossref=. 2021 , cdate=

  41. [41]

    International Conference on Learning Representations , year=

    Nearest Neighbor Machine Translation , author=. International Conference on Learning Representations , year=

  42. [42]

    Patrick S. H. Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela , title=. NeurIPS , crossref=. 2020 , cdate=

  43. [43]

    proceedings of the 25th international conference on world wide web , pages=

    Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering , author=. proceedings of the 25th international conference on world wide web , pages=

  44. [44]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

  45. [45]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  46. [46]

    Surface Form Competition: Why the Highest Probability Answer Isn ' t Always Right

    Holtzman, Ari and West, Peter and Shwartz, Vered and Choi, Yejin and Zettlemoyer, Luke. Surface Form Competition: Why the Highest Probability Answer Isn ' t Always Right. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.564

  47. [47]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

    Nearest Neighbor Zero-Shot Inference , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

  48. [48]

    2004 , isbn =

    Hu, Minqing and Liu, Bing , title =. 2004 , isbn =. doi:10.1145/1014052.1014073 , booktitle =

  49. [49]

    Advances in neural information processing systems , volume=

    Character-level convolutional networks for text classification , author=. Advances in neural information processing systems , volume=

  50. [50]

    Thirty-first AAAI conference on artificial intelligence , year=

    Conceptnet 5.5: An open multilingual graph of general knowledge , author=. Thirty-first AAAI conference on artificial intelligence , year=

  51. [51]

    proceedings of Sinn und Bedeutung , volume=

    The commitmentbank: Investigating projection in naturally occurring discourse , author=. proceedings of Sinn und Bedeutung , volume=

  52. [52]

    arXiv preprint arXiv:2108.02035 , year=

    Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification , author=. arXiv preprint arXiv:2108.02035 , year=

  53. [53]

    S im CSE : Simple Contrastive Learning of Sentence Embeddings

    Gao, Tianyu and Yao, Xingcheng and Chen, Danqi. S im CSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.552

  54. [54]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

    BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

  55. [55]

    Training language models to follow instructions with human feedback

    Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

  56. [56]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Ms marco: A human generated machine reading comprehension dataset , author=. arXiv preprint arXiv:1611.09268 , year=

  57. [57]

    KILT : a Benchmark for Knowledge Intensive Language Tasks

    Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt. KILT : a Benchmark for Knowledge Intensive Language Tasks. Proceedings of the 2021 Conference of the North American Chapter of th...

  58. [58]

    Findings of the Association for Computational Linguistics: ACL 2022 , pages=

    Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

  59. [59]

    S ent E val: An Evaluation Toolkit for Universal Sentence Representations

    Conneau, Alexis and Kiela, Douwe. S ent E val: An Evaluation Toolkit for Universal Sentence Representations. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

  60. [60]

    V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure

    Rosenberg, Andrew and Hirschberg, Julia. V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ). 2007

  61. [61]

    Efficient Natural Language Response Suggestion for Smart Reply

    Efficient natural language response suggestion for smart reply , author=. arXiv preprint arXiv:1705.00652 , year=

  62. [62]

    , author=

    Exploring the limits of transfer learning with a unified text-to-text transformer. , author=. J. Mach. Learn. Res. , volume=

  63. [63]

    M eta ICL : Learning to learn in context

    Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh. M eta ICL : Learning to Learn In Context. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.201

  64. [64]

    MTEB: Massive Text Embedding Benchmark

    MTEB: Massive Text Embedding Benchmark , author=. arXiv preprint arXiv:2210.07316 , year=

  65. [65]

    Transactions on Machine Learning Research , year=

    Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=

  66. [66]

    Semantic clustering and convolutional neural network for short text categorization , author=. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=

  67. [67]

    IEEE Transactions on Big Data , volume=

    Billion-scale similarity search with gpus , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

  68. [68]

    2018 , url=

    Non-Autoregressive Neural Machine Translation , author=. 2018 , url=

  69. [69]

    International Conference on Machine Learning , pages=

    Retrieval-augmented reinforcement learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  70. [70]

    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    Learning To Retrieve Prompts for In-Context Learning , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

  71. [71]

    Khattab, K

    Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP , author=. arXiv preprint arXiv:2212.14024 , year=

  72. [72]

    When not to trust language models: Inves- tigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511,

    When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories , author=. arXiv preprint arXiv:2212.10511 , year=

  73. [73]

    Promptcap: Prompt-guided task- aware image captioning

    PromptCap: Prompt-Guided Task-Aware Image Captioning , author=. arXiv preprint arXiv:2211.09699 , year=

  74. [74]

    Prompting GPT-3 To Be Reliable , author=. Proc. of ICLR , year=

  75. [75]

    Generate rather than retrieve: Large language models are strong context generators , author=. Proc. of ICLR , year=

  76. [76]

    , author=

    Mask-Predict: Parallel Decoding of Conditional Masked Language Models. , author=. 2019 , booktitle=

  77. [77]

    2019 , booktitle=

    Levenshtein Transformer , author=. 2019 , booktitle=

  78. [78]

    Proc.\ of EMNLP , year=

    Hint-Based Training for Non-Autoregressive Machine Translation , author=. Proc.\ of EMNLP , year=

  79. [79]

    Fast Structured Decoding for Sequence Models , author=. Proc. of NeurIPS , year=

  80. [80]

    Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , author=. Proc. of EMNLP , year=

Showing first 80 references.