arxiv: 2604.20417 · v1 · submitted 2026-04-22 · 💻 cs.IR · cs.AI

Recognition: unknown

Semantic Recall for Vector Search

Leonardo Kuffo , Ioanna Tsakalidou , Roberta De Viti , Albert Angel , Ji\v{r}\'i I\v{s}a , Rastislav Lenhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords semantic recallvector searchapproximate nearest neighborretrieval qualityembedding datasetstolerant recallinformation retrieval

0 comments

The pith

Semantic recall is a new metric for vector search that only counts retrieval of semantically relevant nearest neighbors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes semantic recall as a way to judge approximate nearest neighbor algorithms without penalizing them for missing objects that happen to be close in embedding space but irrelevant to the query. Standard recall treats every nearest neighbor as equally important, yet the authors find that many queries have few relevant results among their geometric neighbors, a pattern common in embedding datasets. By restricting evaluation to objects that exact search could retrieve and that carry semantic meaning, the metric gives a clearer signal of whether an algorithm is retrieving what users actually want. They also offer tolerant recall as a practical stand-in when full semantic labels are unavailable. If correct, this shifts how retrieval quality is measured and optimized, favoring algorithms that achieve good results at lower cost.

Core claim

What carries the argument

Semantic Recall, the metric that evaluates retrieval quality solely on semantically relevant objects reachable by exact nearest neighbor search.

If this is right

Algorithms can be tuned to retrieve fewer but more relevant neighbors, improving cost-quality tradeoffs without inflating recall scores.
Evaluation on embedding datasets will reveal that many current high-recall methods perform worse under semantic recall on queries with sparse relevant results.
Tolerant recall provides a usable approximation when semantic labels are absent, enabling immediate application of the idea.
Benchmarking practices in vector search shift toward metrics that separate geometric proximity from semantic utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Search system designers could de-emphasize exact embedding distance in favor of semantic filters, potentially changing index construction.
The same distinction between relevance and proximity may apply to other similarity-based tasks such as recommendation or clustering.
Future work could test whether training embeddings explicitly to increase the density of relevant neighbors raises semantic recall ceilings.

Load-bearing premise

Semantically relevant objects can be reliably identified or approximated for the queries in typical embedding datasets, and missing irrelevant neighbors should not count against performance.

What would settle it

An experiment that measures user satisfaction or task success on a set of real queries and shows that algorithms ranked higher by semantic recall do not produce better outcomes than those ranked higher by traditional recall.

Figures

Figures reproduced from arXiv: 2604.20417 by Albert Angel, Ioanna Tsakalidou, Ji\v{r}\'i I\v{s}a, Leonardo Kuffo, Rastislav Lenhardt, Roberta De Viti.

**Figure 2.** Figure 2: Distribution of semantically relevant neighbors per [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Recall vs cost: Cost rises sharply as recall increases. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the error % between scores computed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the number of semantic neighbors [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

We introduce Semantic Recall, a novel metric to assess the quality of approximate nearest neighbor search algorithms by considering only semantically relevant objects that are theoretically retrievable via exact nearest neighbor search. Unlike traditional recall, semantic recall does not penalize algorithms for failing to retrieve objects that are semantically irrelevant to the query, even if those objects are among their nearest neighbors. We demonstrate that semantic recall is particularly useful for assessing retrieval quality on queries that have few relevant results among their nearest neighbors-a scenario we uncover to be common within embedding datasets. Additionally, we introduce Tolerant Recall, a proxy metric that approximates semantic recall when semantically relevant objects cannot be identified. We empirically show that our metrics are more effective indicators of retrieval quality, and that optimizing search algorithms for these metrics can lead to improved cost-quality tradeoffs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic Recall tweaks vector search metrics to skip irrelevant neighbors, but its edge depends on how well relevance can be labeled at scale.

read the letter

The paper introduces Semantic Recall, which measures how well an ANN algorithm retrieves only the semantically relevant items that sit inside the exact nearest-neighbor set, and pairs it with Tolerant Recall as a proxy when those labels are missing. The central observation is that many queries in embedding datasets have few relevant results among their geometric neighbors, so ordinary recall ends up penalizing systems for missing things that do not matter to the query anyway. If the experiments back this up, the metrics could steer optimization toward more useful cost-quality points in retrieval systems. The definitions themselves look clean and avoid obvious self-reference or fitted parameters. The empirical claim that the low-relevance scenario is common is the part that could matter most for practitioners. The main soft spot is the labeling step. Semantic Recall requires knowing which of the top-k neighbors are actually relevant, and the paper has to show that this can be done reliably, at scale, and without circular dependence on the embeddings or models being evaluated. If the labeling process is noisy, expensive, or limited to small curated sets, both the reported gains and the practical utility shrink. Tolerant Recall is offered as a workaround, but its approximation error needs clear quantification against the labeled version. This is for researchers and engineers who tune or benchmark vector search in production IR settings. It is worth sending to referees so they can check the labeling method, the dataset analysis, and whether the claimed improvements hold under different relevance judgments.

Circularity Check

0 steps flagged

No circularity: metric is a direct definitional restriction with independent empirical support

full rationale

The paper defines semantic recall explicitly as standard recall computed only over the subset of nearest neighbors that are semantically relevant to the query. This is a straightforward restriction rather than a reduction of any derived quantity back to fitted parameters or self-referential equations. The observation that queries with few relevant neighbors are common is presented as an empirical finding obtained via external labeling, not as a mathematical consequence derived from the metric itself. Tolerant Recall is introduced as a separate proxy approximation without any shown dependency that loops back to the primary metric's outputs. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description to justify core claims. The derivation chain remains self-contained against external benchmarks for relevance labeling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes semantic relevance labels or approximations are available or estimable.

axioms (1)

domain assumption Semantically relevant objects can be identified independently of the nearest-neighbor geometry.
Required for the metric definition to be computable and for the claim that many nearest neighbors are irrelevant.

pith-pipeline@v0.9.0 · 5444 in / 1071 out tokens · 31138 ms · 2026-05-09T23:28:47.870342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Mihir Agarwal, Ankit Garg, Neeraj Kayal, Kirankumar Shiragur, et al . 2026. On Strengths and Limitations of Single-Vector Embeddings.arXiv preprint arXiv:2603.29519(2026)

work page arXiv 2026
[2]

Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN- Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems87 (2020), 101374

2020
[3]

Federico Cabitza, Andrea Campagner, and Valerio Basile. 2023. Toward a per- spectivist turn in ground truthing for predictive computing. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 6860–6868

2023
[4]

Manos Chatzakis, Yannis Papakonstantinou, and Themis Palpanas. 2025. DARTH: Declarative Recall Through Early Termination for Approximate Nearest Neighbor Search.Proceedings of the ACM on Management of Data3, 4 (2025), 1–26

2025
[5]

Tingyang Chen, Cong Fu, Jiahua Wu, Haotian Wu, Hua Fan, Xiangyu Ke, Yunjun Gao, Yabo Ni, and Anxiang Zeng. 2025. Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views.arXiv preprint arXiv:2512.12980(2025)

work page arXiv 2025
[6]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

2019
[8]

Yifan Ding, Nicholas Botzer, and Tim Weninger. 2022. Posthoc verification and the fallibility of the ground truth. InProceedings of the First Workshop on Dynamic Adversarial Data Collection. 23–29

2022
[9]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2025. The faiss library.IEEE Transactions on Big Data(2025)

2025
[10]

Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, and Raymond Chi-Wing Wong. 2025. Practical and asymptotically optimal quantization of high- dimensional vectors in euclidean space for approximate nearest neighbor search. Proceedings of the ACM on Management of Data3, 3 (2025), 1–26

2025
[11]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821(2021)

work page internal anchor Pith review arXiv 2021
[12]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1487–1495. doi:10.1145/3097983.3098043

work page doi:10.1145/3097983.3098043 2017
[13]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InInternational Conference on Machine Learning. PMLR, 3887–3896

2020
[14]

Elias Jääsaari, Ville Hyvönen, Matteo Ceccarello, Teemu Roos, and Martin Aumüller. 2025. VIBE: Vector Index Benchmark for Embeddings.arXiv preprint arXiv:2505.17810(2025)

work page arXiv 2025
[15]

Leonardo Kuffo, Elena Krippner, and Peter Boncz. 2025. PDX: A Data Layout for Vector Similarity Search.Proceedings of the ACM on Management of Data3, 3 (2025), 1–26

2025
[16]

Leonardo Kuffo, Ioanna Tsakalidou, Roberta De Viti, Albert Angel, Jiří Iša, and Rastislav Lenhardt. 2026. Reproducibility: Semantic Recall for Vector Search. https://colab.research.google.com/drive/1cUnvdRP7CjeJvx5eaAzjA-J5d_ d3oUk7. Google Colaboratory notebook

2026
[17]

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. 2025. Gemini embedding: Generalizable embeddings from gemini. arXiv preprint arXiv:2503.07891(2025)

work page internal anchor Pith review arXiv 2025
[18]

Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, and Jing Li. 2025. ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Rerank- ing.arXiv preprint arXiv:2506.03487(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transactions on pattern analysis and machine intelligence42, 4 (2018), 824–836

2018
[20]

Jason Mohoney, Devesh Sarda, Mengze Tang, Shihabur Rahman Chowdhury, Anil Pacaci, Ihab F Ilyas, Theodoros Rekatsinas, and Shivaram Venkataraman. 2025. Quake: Adaptive Indexing for Vector Search.arXiv preprint arXiv:2506.03437 (2025)

work page arXiv 2025
[21]

James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.The VLDB Journal33, 5 (2024), 1591–1615

2024
[22]

Yannis Papakonstantinou, Alan Li, Ruiqi Guo, Sanjiv Kumar, and Phil Sun. 2024. ScaNN for AlloyDB. (2024). https://services.google.com/fh/files/misc/scann_ for_alloydb_whitepaper.pdf

2024
[23]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543

2014
[24]

Jeffrey Pound, Floris Chabert, Arjun Bhushan, Ankur Goswami, Anil Pacaci, and Shihabur Rahman Chowdhury. 2025. MicroNN: An On-device Disk-resident Updatable Vector Database. InCompanion of the 2025 International Conference on Management of Data. 608–621

2025
[25]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research21, 140 (2020), 1–67

2020
[26]

2023.Introducing Embed v3

Nils Reimers, Elliott Choi, Amr Kayid, Alekhya Nandula, Manoj Govindassamy, and Abdullah Elkady. 2023.Introducing Embed v3. Cohere. https://cohere.com/ blog/introducing-embed-v3 Cohere Blog; published November 2, 2023

2023
[27]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084(2019)

work page internal anchor Pith review arXiv 2019
[28]

Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krish- naswamny, Gopal Srinivasa, et al. 2022. Results of the NeurIPS’21 challenge on billion-scale approximate nearest neighbor search. InNeurIPS 2021 Competitions and Demonstrations Track. PMLR, 177–189

2022
[29]

Philip Sun, Ruiqi Guo, and Sanjiv Kumar. 2023. Automating Nearest Neighbor Search Configuration with Constrained Optimization. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/forum?id=KfptQCEKVW4

2023
[30]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Wenping Wang, Yunxi Guo, Chiyao Shen, Shuai Ding, Guangdeng Liao, Hao Fu, and Pramodh Karanth Prabhakar. 2023. Integrity and junkiness failure handling for embedding-based retrieval: A case study in social network search. InProceed- ings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3250–3254

2023
[32]

Zikai Wang, Qianxi Zhang, Baotong Lu, Qi Chen, and Cheng Tan. 2025. Towards Robustness: A Critique of Current Vector Database Assessments.arXiv preprint arXiv:2507.00379(2025)

work page arXiv 2025
[33]

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. 2025. On the the- oretical limitations of embedding-based retrieval.arXiv preprint arXiv:2508.21038 (2025)

work page arXiv 2025
[34]

Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2023. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics11 (2023), 1114–1131

2023
[35]

Keneilwe Zuva and Tranos Zuva. 2012. Evaluation of information retrieval systems.International journal of computer science & information technology4, 3 (2012), 35

2012