pith. sign in

arxiv: 2606.00734 · v1 · pith:HGCOASDDnew · submitted 2026-05-30 · 💻 cs.DB

EMA: Approximate Nearest Neighbor Search with General Attribute Filtering and Dynamic Updates

Pith reviewed 2026-06-28 17:56 UTC · model grok-4.3

classification 💻 cs.DB
keywords filtering approximate nearest neighborvector databaseattribute filteringdynamic updatesgraph indexpredicate queriesmulti-predicate search
0
0 comments X

The pith

EMA attaches compact markers to graph edges to enable fast general attribute filtering in approximate nearest neighbor search with dynamic updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EMA, a filtering ANN algorithm for multi-predicate queries over mixed numerical and categorical attributes in vector databases. It attaches markers as compact summaries to graph edges to give conservative predicate- and geometric-aware guidance that guarantees zero false negatives at the marker level. Query processing uses marker-augmented joint search plus bounded edge recovery to filter results while keeping the graph navigable, and the design also supports efficient dynamic updates. A sympathetic reader would care because the method targets practical workloads in recommendation systems, retrieval-augmented generation, and agent memory where prior approaches either limit predicate types or incur high overhead.

Core claim

EMA achieves 1.68x--12.25x speedup over state-of-the-art general filtering ANN methods across diverse workloads by introducing Markers as compact summaries attached to graph edges that provide conservative predicate- and geometric-aware guidance with zero false negatives at the Marker level, then performing Marker-augmented joint search with a bounded edge recovery mechanism that enables efficient filtering while preserving graph navigability and supporting dynamic updates.

What carries the argument

Markers attached to graph edges, compact summaries that supply conservative predicate- and geometric-aware guidance with zero false negatives at the marker level.

If this is right

  • EMA supports multi-predicate queries over mixed numerical and categorical attributes.
  • It enables efficient dynamic updates to the index.
  • It delivers 1.68x--12.25x speedup over prior general filtering ANN methods on diverse workloads.
  • Marker-augmented joint search with bounded recovery preserves graph navigability during filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vector databases could support richer real-time filtering for recommendation and retrieval-augmented generation without separate index structures.
  • The marker technique might apply to other proximity-graph indexes that currently lack general predicate support.
  • Lower construction and memory costs could make filtered ANN viable at larger scales than current predicate-agnostic methods allow.

Load-bearing premise

Markers attached to graph edges provide conservative predicate- and geometric-aware guidance with zero false negatives at the Marker level, enabling efficient filtering while preserving graph navigability.

What would settle it

An experiment on a workload where marker-guided search returns a different result set from exhaustive filtered search, or where measured speedups fall below the claimed range across the tested workloads.

Figures

Figures reproduced from arXiv: 2606.00734 by Baotong Lu, Chenhao Ma, James Cheng, Mocheng Li.

Figure 2
Figure 2. Figure 2: EMA workflow. Codebook Marker [0,3) [3,6) [6,9) {a,d} {b,e} {c,f} 1 0 0 1 0 1 Vector w 2 {a,c} Vector x 5 {d} Vector v 6 {c} 1 0 1 1 0 1 Gather Gather [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Codebook-based Marker encoding [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: QPS for label+range multi-predicate query at 10%-100% selectivity with 95% recall@10. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: QPS for label+range multi-predicate query at 1%-10% selectivity with 95% recall@10. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: QPS for composed multi-predicate queries at 1%–10% selectivity with 95% recall@10. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: QPS for range query at 95% recall@10 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: QPS of EMA under different update operations at 95% recall@10. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: QPS in OCQ at 95% recall@10 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Filtering Approximate Nearest Neighbor (FANN) search is a critical and emerging task for strengthening the query capability of vector databases, supporting applications such as recommendation systems, retrieval-augmented generation (RAG), and agent memory. However, most existing methods are limited to range or label filtering, often incurring unacceptable index construction time and memory overhead. Predicate-agnostic approaches further struggle to handle a wide range of predicate selectivities effectively. In this paper, we propose EMA, a filtering ANN algorithm that supports multi-predicate queries over mixed numerical and categorical attributes, and efficient dynamic updates. EMA introduces Markers as compact summaries attached to graph edges, providing conservative predicate- and geometric-aware guidance with zero false negatives at the Marker level. During query processing, EMA performs Marker-augmented joint search with a bounded edge recovery mechanism, enabling efficient filtering while preserving graph navigability. Extensive experiments demonstrate that EMA achieves 1.68x--12.25x speedup over state-of-the-art general filtering ANN methods across diverse workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes EMA, a filtering ANN algorithm supporting multi-predicate queries over mixed numerical and categorical attributes together with dynamic updates. It introduces Markers as compact summaries attached to graph edges that provide conservative predicate- and geometric-aware guidance with zero false negatives at the Marker level, combined with Marker-augmented joint search and bounded edge recovery to preserve navigability. The central claim is that EMA delivers 1.68x--12.25x speedup over state-of-the-art general filtering ANN methods across diverse workloads.

Significance. If the Marker construction and bounded-recovery mechanism can be shown to deliver the claimed zero-false-negative property while maintaining graph navigability, and if the reported speedups are reproducible, the work would meaningfully advance predicate-aware ANN search in vector databases for RAG, recommendation, and agent-memory workloads.

major comments (1)
  1. Abstract: the claim of 1.68x--12.25x speedup is presented without any description of baselines, workloads, error bars, or exclusion criteria, preventing assessment of whether the data supports the central performance claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the claim of 1.68x--12.25x speedup is presented without any description of baselines, workloads, error bars, or exclusion criteria, preventing assessment of whether the data supports the central performance claim.

    Authors: We agree that the abstract, as a concise summary, omits the specific experimental details needed for immediate assessment. The baselines are the state-of-the-art general filtering ANN methods (detailed in Section 2 and compared in Section 5), the workloads cover diverse predicate selectivities and attribute mixes on standard vector datasets, and results include error bars from repeated runs with no exclusion of outliers. To improve clarity and allow readers to evaluate the claim without reading the full paper, we will revise the abstract to briefly reference the evaluation setup (e.g., 'over state-of-the-art baselines on diverse workloads with reported error bars'). revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters, or mathematical claims that could reduce to inputs by construction. The work describes an algorithmic construction (Markers attached to graph edges for conservative guidance) and reports empirical speedups, with no self-definitional steps, fitted-input predictions, or load-bearing self-citations visible in the text. The central claims rest on experimental validation rather than any closed-form reduction, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities beyond the high-level description of Markers can be identified.

invented entities (1)
  • Markers no independent evidence
    purpose: compact summaries attached to graph edges for predicate- and geometric-aware guidance
    New component introduced to enable filtering without false negatives at the Marker level

pith-pipeline@v0.9.1-grok · 5708 in / 1063 out tokens · 32753 ms · 2026-06-28T17:56:01.574016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN- Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems87 (2020), 101374

  2. [2]

    Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching.Commun. ACM18, 9 (1975), 509–517

  3. [3]

    Angela Bonifati, Wim Martens, and Thomas Timm. 2017. An analytical study of large SPARQL query logs.arXiv preprint arXiv:1708.00363(2017)

  4. [4]

    Lawrence Cayton. 2008. Fast nearest neighbor retrieval for bregman divergences. InProceedings of the 25th international conference on Machine learning. 112–119

  5. [5]

    Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. 2021. Redcaps: Web-curated image-text data created by the people, for the people.arXiv preprint arXiv:2111.11431(2021)

  6. [6]

    Wei Dong. 2011. Kgraph: A library for approximate nearest neighbor search. Retrieved July12 (2011), 2020

  7. [7]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The faiss library.arXiv preprint arXiv:2401.08281(2024)

  8. [8]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

  9. [9]

    Joshua Engels, Benjamin Landrum, Shangdi Yu, Laxman Dhulipala, and Julian Shun. 2024. Approximate Nearest Neighbor Search with Window Filters.arXiv preprint arXiv:2402.00943(2024)

  10. [10]

    Mocheng Li et al. 2026. EMA: Appendix. https://github.com/lmccccc/EMA/blob/ main/EMA_appendix.pdf. Supplementary material

  11. [11]

    Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. 2023. Kùzu graph database management system. InThe Conference on Innovative Data Systems Research, Vol. 7. 25–35

  12. [12]

    Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. 1977. An algo- rithm for finding best matches in logarithmic expected time.ACM Transactions on Mathematical Software (TOMS)3, 3 (1977), 209–226

  13. [13]

    Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive hashing scheme based on dynamic collision counting. InProceedings of the 2012 ACM SIGMOD international conference on management of data. 541–552

  14. [14]

    Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization.IEEE transactions on pattern analysis and machine intelligence36, 4 (2013), 744–755

  15. [15]

    Siddharth Gollapudi, Neel Karia, Varun Sivashankar, Ravishankar Krishnaswamy, Nikit Begwani, Swapnil Raz, Yiyong Lin, Yin Zhang, Neelam Mahapatro, Premku- mar Srinivasan, et al. 2023. Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters. InProceedings of the ACM Web Conference

  16. [16]

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InInternational Conference on Machine Learning. PMLR, 3887–3896

  17. [17]

    Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. InProceedings of the 1984 ACM SIGMOD international conference on Management of data. 47–57

  18. [18]

    Ben Harwood and Tom Drummond. 2016. Fanng: Fast approximate nearest neighbour graphs. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5713–5722

  19. [19]

    Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proceedings of the VLDB Endowment9, 1 (2015), 1–12

  20. [20]

    Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613

  21. [21]

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information pro- cessing Systems32 (2019)

  22. [22]

    Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011. Searching in one billion vectors: re-rank with source coding. In2011 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 861–864

  23. [23]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

  24. [24]

    Andrew Kane. n.d.. PGVector. https://github.com/pgvector/pgvector. 13

  25. [25]

    1968.Regionalization, Theory and alternative algorithms

    Philip M Lankford. 1968.Regionalization, Theory and alternative algorithms. Ph.D. Dissertation. The University of Chicago

  26. [26]

    Anqi Liang, Pengcheng Zhang, Bin Yao, Zhongpu Chen, Yitong Song, and Guangxu Cheng. 2024. UNIFY: Unified Index for Range Filtered Approximate Nearest Neighbors Search.arXiv preprint arXiv:2412.02448(2024)

  27. [27]

    Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov

  28. [28]

    Approximate nearest neighbor algorithm based on navigable small world graphs.Information Systems45 (2014), 61–68

  29. [29]

    Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transactions on pattern analysis and machine intelligence42, 4 (2018), 824–836

  30. [30]

    Yusuke Matsui, Ryota Hinami, and Shin’ichi Satoh. 2018. Reconfigurable Inverted Index. InProceedings of the 26th ACM international conference on Multimedia. 1715–1723

  31. [31]

    Yitong Meng, Xinyan Dai, Xiao Yan, James Cheng, Weiwen Liu, Jun Guo, Benben Liao, and Guangyong Chen. 2020. Pmd: An optimal transportation-based user distance for recommender systems. InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42. Springer, 272–280

  32. [32]

    Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In2011 IEEE 27th International Conference on Data Engineering. IEEE, 984–994

  33. [33]

    Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based news recommendation for millions of users. InProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1933–1942

  34. [34]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv e-prints(2024), arXiv–2407

  35. [35]

    Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. Acorn: Per- formant and predicate-agnostic search over vector embeddings and structured data.Proceedings of the ACM on Management of Data2, 3 (2024), 1–27

  36. [36]

    Arkadiusz Paterek. 2007. Improving regularized singular value decomposition for collaborative filtering. InProceedings of KDD cup and workshop, Vol. 2007. 5–8

  37. [37]

    Zhencan Peng, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. [n.d.]. Dynamic Range-Filtering Approximate Nearest Neighbor Search. ([n. d.])

  38. [38]

    Zhencan Peng, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. 2025. Dynamic Range-Filtering Approximate Nearest Neighbor Search.Proceedings of the VLDB Endowment18, 10 (2025), 3256–3268

  39. [39]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  40. [40]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  41. [41]

    Gaurav Sehgal and Semih Salihoglu. 2025. NaviX: A Native Vector Index Design for Graph DBMSs With Robust Predicate-Agnostic Search Performance.arXiv preprint arXiv:2506.23397(2025)

  42. [42]

    Aditi Singh, Suhas Jayaram Subramanya, Ravishankar Krishnaswamy, and Har- sha Vardhan Simhadri. 2021. Freshdiskann: A fast and accurate graph-based ann index for streaming similarity search.arXiv preprint arXiv:2105.09613(2021)

  43. [43]

    Godfried T Toussaint. 1980. The relative neighbourhood graph of a finite planar set.Pattern recognition12, 4 (1980), 261–268

  44. [44]

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xi- angyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A purpose-built vector data management system. InProceedings of the 2021 international conference on management of data. 2614–2627

  45. [45]

    Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and Jiongkang Ni. 2024. An efficient and robust framework for approximate near- est neighbor search with attribute constraint.Advances in Neural Information Processing Systems36 (2024)

  46. [46]

    Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A com- prehensive survey and experimental comparison of graph-based approximate nearest neighbor search.arXiv preprint arXiv:2101.12631(2021)

  47. [47]

    Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. 2025. ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation. arXiv preprint arXiv:2502.09891(2025)

  48. [48]

    Weaviate. 2024. Weaviate. https://weaviate.io/

  49. [49]

    Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, and Yuanzhe Cai. 2020. AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data.Proceedings of the VLDB Endowment 13, 12 (2020), 3152–3165

  50. [50]

    Yuexuan Xu, Jianyang Gao, Yutong Gou, Cheng Long, and Christian S Jensen

  51. [51]

    irangegraph: Improvising range-dedicated graphs for range-filtering near- est neighbor search.Proceedings of the ACM on Management of Data2, 6 (2024), 1–26

  52. [52]

    Mingyu Yang, Wentao Li, Zhitao Shen, Chuan Xiao, and Wei Wang. 2025. ESG: Elastic Graphs for Range-Filtering Approximate k-Nearest Neighbor Search. arXiv preprint arXiv:2504.04018(2025)

  53. [53]

    Wen Yang, Tao Li, Gai Fang, and Hong Wei. 2020. Pase: Postgresql ultra-high- dimensional approximate nearest neighbor search extension. InProceedings of the 2020 ACM SIGMOD international conference on management of data. 2241–2253

  54. [54]

    Peter Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. (1993)

  55. [55]

    Qianxi Zhang, Shuotao Xu, Qi Chen, Guoxin Sui, Jiadong Xie, Zhizhen Cai, Yaoqi Chen, Yinxuan He, Yuqing Yang, Fan Yang, et al . 2023. {VBASE}: Unifying online vector similarity search and relational queries via relaxed monotonicity. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 377–395

  56. [56]

    Weijie Zhao, Shulong Tan, and Ping Li. 2022. Constrained approximate similarity search on proximity graph.arXiv preprint arXiv:2210.14958(2022)

  57. [57]

    Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. 2025. In-depth Analysis of Graph-based RAG in a Unified Framework.arXiv preprint arXiv:2503.04338 (2025)

  58. [58]

    Chaoji Zuo, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. 2024. SeRF: seg- ment graph for range-filtering approximate nearest neighbor search.Proceedings of the ACM on Management of Data2, 1 (2024), 1–26. 14