Recognition: 1 theorem link
Dense Passage Retrieval for Open-Domain Question Answering
Pith reviewed 2026-05-15 21:21 UTC · model grok-4.3
The pith
Dense vector embeddings from a dual-encoder model outperform BM25 by 9-19 percent in top-20 passage retrieval for open-domain question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Open-domain question answering relies on efficient passage retrieval, traditionally done with sparse models such as TF-IDF or BM25. We demonstrate that retrieval can instead be implemented using dense representations alone. These embeddings are learned from a small number of questions and passages using a simple dual-encoder framework. When tested on multiple open-domain QA datasets, the dense retriever outperforms a strong Lucene-BM25 system by 9 to 19 percent absolute in top-20 passage retrieval accuracy. This retrieval improvement allows our end-to-end QA system to reach new state-of-the-art performance on the benchmarks.
What carries the argument
A dual-encoder model that independently embeds questions and passages into a shared dense vector space for similarity-based retrieval.
If this is right
- Higher top-20 retrieval accuracy leads to more relevant contexts being available for the reader module in QA systems.
- The method can be integrated into existing QA pipelines to boost overall accuracy.
- It establishes new performance records on multiple standard open-domain QA benchmarks.
- Dense retrieval becomes a viable practical alternative to sparse indexing methods.
Where Pith is reading between the lines
- Neural dense retrieval may reduce dependence on exact term overlap, capturing semantic matches instead.
- The approach could be extended by combining dense and sparse signals for hybrid retrieval.
- Generalization from small training sets implies that the model learns robust semantic features applicable to unseen queries.
Load-bearing premise
Embeddings trained on a limited set of questions and passages will generalize well to the broader range of queries and documents seen during testing.
What would settle it
Observing no improvement or a decrease in top-20 passage retrieval accuracy for the dense model compared to BM25 on a standard open-domain QA test set would falsify the performance claim.
read the original abstract
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dense passage retrieval method for open-domain QA based on a dual-encoder framework that learns embeddings from a modest number of question-passage pairs. It reports that the resulting retriever outperforms a strong Lucene-BM25 baseline by 9-19% absolute top-20 accuracy across several QA datasets and, when plugged into an end-to-end reader, yields new state-of-the-art results on multiple open-domain QA benchmarks.
Significance. If the empirical gains hold, the work provides a practical demonstration that supervised dense retrieval can substantially surpass classical sparse methods without requiring hand-crafted features or inverted indexes, thereby shifting the default retrieval component in open-domain QA pipelines toward learned embeddings.
major comments (1)
- [Section 3 (Training) and experimental setup] The training procedure (negative sampling strategy and construction of the training set) is not ablated; without these controls it remains possible that the reported 9-19% gains partly reflect dataset-specific selection effects rather than the dual-encoder architecture itself.
minor comments (2)
- [Abstract] The abstract states gains of '9%-19%' but does not report per-dataset numbers, standard deviations, or confidence intervals; a table with these statistics would make the strength of the improvement clearer.
- [Section 2] Notation for the dual-encoder scoring function and the contrastive loss should be introduced once in a single equation block rather than scattered across prose.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: [Section 3 (Training) and experimental setup] The training procedure (negative sampling strategy and construction of the training set) is not ablated; without these controls it remains possible that the reported 9-19% gains partly reflect dataset-specific selection effects rather than the dual-encoder architecture itself.
Authors: We agree that an explicit ablation of negative sampling and training-set construction would strengthen the claims. Our main experiments compare the trained dual-encoder against a strong unsupervised BM25 baseline on the same corpora, which already isolates the benefit of learned dense representations. Nevertheless, to directly address the concern, we will add a new ablation subsection in the revised manuscript that reports retrieval accuracy when training with (i) random negatives, (ii) BM25-retrieved hard negatives, and (iii) varying numbers of negatives per question. These additional controls will clarify how much of the observed 9-19% improvement is attributable to the dual-encoder architecture versus the particular negative-sampling procedure. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper trains a dual-encoder model via standard contrastive loss on QA pairs to produce dense passage embeddings, then measures top-k retrieval accuracy on held-out test portions of standard benchmarks (Natural Questions, TriviaQA, etc.). No equation or claim reduces the reported 9-19% gains to a fitted parameter by construction, nor does any load-bearing step rely on a self-citation chain that is itself unverified. The evaluation is ordinary supervised held-out testing; the derivation chain (indexing, retrieval, end-to-end QA) remains externally falsifiable and does not collapse into its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Embeddings learned via contrastive loss on QA pairs will place relevant passages near their questions in vector space
Forward citations
Cited by 19 Pith papers
-
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
-
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming ...
-
Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models
OKH-RAG represents knowledge as ordered hyperedges and retrieves coherent interaction sequences via a learned transition model, outperforming permutation-invariant RAG baselines on order-sensitive QA tasks.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
MemGPT: Towards LLMs as Operating Systems
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
Unsupervised Dense Information Retrieval with Contrastive Learning
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
-
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
Unified Supervision for Walmart's Sponsored Search Retrieval via Joint Semantic Relevance and Behavioral Engagement Modeling
A hybrid supervision method for bi-encoder retrievers combines graded relevance from teacher models, production retrieval priors, and selective engagement to improve relevance and NDCG over Walmart's current sponsored...
Reference graph
Works this paper leans on
-
[1]
Nogueira, Rodrigo and Cho, Kyunghyun , journal=. Passage Re-ranking with
- [2]
-
[3]
Relevance-guided Supervision for OpenQA with
Khattab, Omar and Potts, Christopher and Zaharia, Matei , journal=. Relevance-guided Supervision for OpenQA with
-
[4]
The probabilistic relevance framework:
Robertson, Stephen and Zaragoza, Hugo , journal=. The probabilistic relevance framework:
-
[5]
Learning to Retrieve Reasoning Paths over
Asai, Akari and Hashimoto, Kazuma and Hajishirzi, Hannaneh and Socher, Richard and Xiong, Caiming , booktitle=iclr, year=. Learning to Retrieve Reasoning Paths over
-
[6]
Min, Sewon and Chen, Danqi and Hajishirzi, Hannaneh and Zettlemoyer, Luke , booktitle=emnlp, year=. A Discrete Hard
-
[7]
Min, Sewon and Chen, Danqi and Zettlemoyer, Luke and Hajishirzi, Hannaneh , title=. 2019 , journal=
work page 2019
-
[8]
End-to-End Open-Domain Question Answering with BERTserini , author=
-
[9]
Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine , booktitle=acl, pages=. Reading
-
[10]
Revealing the Importance of Semantic Retrieval for Machine Reading at Scale , author =
-
[11]
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=emnlp, pages=
-
[12]
Ferrucci, David A , journal=. Introduction to ``. 2012 , publisher=
work page 2012
-
[13]
ACM Transactions on Information Systems (TOIS) , volume=
Performance issues and error analysis in an open-domain question answering system , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2003 , publisher=
work page 2003
-
[14]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=naacl, year=
-
[15]
Wang, Zhiguo and Ng, Patrick and Ma, Xiaofei and Nallapati, Ramesh and Xiang, Bing , booktitle=emnlp, year=. Multi-passage
-
[16]
Billion-scale similarity search with
Johnson, Jeff and Douze, Matthijs and J. Billion-scale similarity search with. ArXiv , volume=
-
[17]
Latent Retrieval for Weakly Supervised Open Domain Question Answering , author =
-
[18]
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index , author =
-
[19]
Voorhees, Ellen M , booktitle=. The
-
[20]
Natural Questions: a Benchmark for Question Answering Research , author =
-
[21]
Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy , booktitle=emnlp, year=. Semantic parsing on
-
[22]
International Conference of the Cross-Language Evaluation Forum for European Languages , pages=
Modeling of the question answering task in the yodaqa system , author=. International Conference of the Cross-Language Evaluation Forum for European Languages , pages=. 2015 , organization=
work page 2015
-
[23]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. 2017
work page 2017
-
[24]
Communications of the ACM , volume=
Natural language question-answering systems: 1969 , author=. Communications of the ACM , volume=. 1970 , publisher=
work page 1969
-
[25]
Green,Jr., Bert F. and Wolf, Alice K. and Chomsky, Carol and Laughery, Kenneth , title =. Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference , series =. 1961 , location =
work page 1961
-
[26]
Proceedings of the June 4-8, 1973, national computer conference and exposition , pages=
Progress in natural language understanding: an application to lunar geology , author=. Proceedings of the June 4-8, 1973, national computer conference and exposition , pages=. 1973 , organization=
work page 1973
-
[27]
Empirical Methods in Natural Language Processing (EMNLP) , pages=
An analysis of the AskMSR question-answering system , author=. Empirical Methods in Natural Language Processing (EMNLP) , pages=
- [28]
-
[29]
Proceedings of the 19th international conference on Computational linguistics-Volume 1 , pages=
Learning question classifiers , author=. Proceedings of the 19th international conference on Computational linguistics-Volume 1 , pages=. 2002 , organization=
work page 2002
-
[30]
Logic Form Transformation of W ord N et and its Applicability to Question Answering
Moldovan, Dan and Rus, Vasile. Logic Form Transformation of W ord N et and its Applicability to Question Answering. 2001
work page 2001
-
[31]
Tellex, Stefanie and Katz, Boris and Lin, Jimmy and Fernandes, Aaron and Marton, Gregory , title =. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval , series =. 2003 , isbn =. doi:10.1145/860435.860445 , acmid =
-
[32]
Richardson, Matthew and Burges, Christopher J.C. and Renshaw, Erin. MCT est: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. 2013
work page 2013
-
[33]
Zhen-Zhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut , journal=. 2019 , volume=
work page 2019
-
[34]
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=. 2019 , volume=
work page 2019
-
[35]
Journal of the American society for information science , volume=
Indexing by latent semantic analysis , author=. Journal of the American society for information science , volume=. 1990 , publisher=
work page 1990
-
[36]
Signature verification using a ``
Bromley, Jane and Guyon, Isabelle and LeCun, Yann and S. Signature verification using a ``. NIPS , pages=
-
[37]
Learning a similarity metric discriminatively, with application to face verification , author=
-
[38]
Learning discriminative projections for text similarity measures , author=
-
[39]
Learning deep structured semantic models for
Huang, Po-Sen and He, Xiaodong and Gao, Jianfeng and Deng, Li and Acero, Alex and Heck, Larry , booktitle=cikm, pages=. Learning deep structured semantic models for
-
[40]
Learning Dense Representations for Entity Retrieval
Gillick, Daniel and Kulkarni, Sayali and Lansing, Larry and Presta, Alessandro and Baldridge, Jason and Ie, Eugene and Garcia-Olano, Diego. Learning Dense Representations for Entity Retrieval. 2019
work page 2019
-
[41]
Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , author=
-
[42]
Efficient natural language response suggestion for smart reply , author=. ArXiv , volume=
- [43]
-
[44]
Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel , journal=
-
[45]
Empirical Methods in Natural Language Processing (EMNLP) , month =
Yih, Wen-tau , title =. Empirical Methods in Natural Language Processing (EMNLP) , month =. 2009 , address =
work page 2009
-
[46]
Contextualized Sparse Representation with Rectified
Lee, Jinhyuk and Seo, Minjoon and Hajishirzi, Hannaneh and Kang, Jaewoo , journal=. Contextualized Sparse Representation with Rectified
-
[47]
Foundations and Trends in Machine Learning , volume=
Metric learning: A survey , author=. Foundations and Trends in Machine Learning , volume=. 2013 , publisher=
work page 2013
-
[48]
Denoising distantly supervised open-domain question answering , author=
-
[49]
Wang, Shuohang and Yu, Mo and Guo, Xiaoxiao and Wang, Zhiguo and Klinger, Tim and Zhang, Wei and Chang, Shiyu and Tesauro, Gerry and Zhou, Bowen and Jiang, Jing , booktitle=AAAI, year=. R\^
-
[50]
How Much Knowledge Can You Pack Into the Parameters of a Language Model? , author=. ArXiv , volume=
-
[51]
Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=
-
[52]
Learning and inference via maximum inner product search , author=
-
[53]
Maximum inner-product search using cone trees , author=. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
-
[54]
Break It Down: A Question Understanding Benchmark , author=
- [55]
-
[56]
Artificial Intelligence and Statistics , pages=
Quantization based fast inner product search , author=. Artificial Intelligence and Statistics , pages=
-
[57]
Proceedings of the 22nd international conference on Machine learning , pages=
Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=
-
[58]
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering , author=. ArXiv , year=
-
[59]
Multi-step retriever-reader interaction for scalable open-domain question answering , author=
-
[60]
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author=. ArXiv , year=
-
[61]
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering , author=. ArXiv , month=. 2019 , volume=
work page 2019
-
[62]
Retrieval-augmented generation for knowledge-intensive
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandara and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-augmented generation for knowledge-intensive
-
[63]
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. 2020
work page 2020
-
[64]
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author=. ArXiv , month=. 2020 , volume=
work page 2020
-
[65]
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. ArXiv , year=
-
[66]
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020. Learning to retrieve reasoning paths over Wikipedia graph for question answering. In International Conference on Learning Representations (ICLR)
work page 2020
-
[67]
Petr Baudi s and Jan S ediv \`y . 2015. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222--228. Springer
work page 2015
-
[68]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Empirical Methods in Natural Language Processing (EMNLP)
work page 2013
-
[69]
a ckinger, and Roopak Shah. 1994. Signature verification using a `` Siamese
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S \"a ckinger, and Roopak Shah. 1994. Signature verification using a `` Siamese " time delay neural network. In NIPS, pages 737--744
work page 1994
-
[70]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89--96
work page 2005
-
[71]
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL), pages 1870--1879
work page 2017
-
[72]
Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain question answering. In International Conference on Learning Representations (ICLR)
work page 2019
-
[73]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407
work page 1990
-
[74]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL)
work page 2019
-
[75]
David A Ferrucci. 2012. Introduction to `` This is Watson ". IBM Journal of Research and Development, 56(3.4):1--1
work page 2012
-
[76]
Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. 2019. Learning dense representations for entity retrieval. In Computational Natural Language Learning (CoNLL)
work page 2019
-
[77]
Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based fast inner product search. In Artificial Intelligence and Statistics, pages 482--490
work page 2016
-
[78]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM : Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[79]
Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, L \'a szl \'o Luk \'a cs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. ArXiv, abs/1705.00652
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[80]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for Web search using clickthrough data. In ACM International Conference on Information and Knowledge Management (CIKM), pages 2333--2338
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.