pith. machine review for the scientific record. sign in

arxiv: 2604.24552 · v1 · submitted 2026-04-27 · 💻 cs.DB

Recognition: unknown

BoomHQ: Learning to Boost Multiple Hybrid Queries on Vector DBMSs

Ermu Qiu, Jun Gao, Tianyi Chen, Xing Wei, Yang Lin, Yaofeng Tu, Yinjun Han

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:06 UTC · model grok-4.3

classification 💻 cs.DB
keywords hybrid queriesvector databasesautoencodersquery optimizationscalar predicatesquery rewritingnearest neighbor search
0
0 comments X

The pith

An autoencoder learns vector-scalar correlations and neighborhood selectivities to rewrite hybrid queries for faster execution on vector databases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid queries that combine vector nearest-neighbor searches with scalar predicates are hard to run efficiently when queries grow complex or multiple vectors appear. The paper shows that modeling the hidden links between vector embeddings and scalar values with an autoencoder, plus estimating how selective the scalars tend to be near a query vector, supplies the signals needed to rewrite each query with better execution hints. A reader would care because the approach promises large reductions in running time for repeated hybrid workloads while still meeting fixed recall targets and tolerating data changes.

Core claim

BoomHQ first trains an autoencoder on the joint distribution of vector and scalar attributes to capture their correlations in an update-friendly way. It then estimates the selectivity of scalar predicates inside the neighborhood of each incoming query vector to reflect observed query patterns. These two learned features are used to predict execution hints that rewrite the original hybrid query into an optimized plan. On extended benchmarks that embed realistic correlations, the rewrites produce a 2x average and over 25x peak speedup for batches of hybrid queries at given recall levels, and the same gains hold across three representative vector database systems even after updates.

What carries the argument

An autoencoder that encodes correlations between vector and scalar attributes together with a neighborhood-based estimator of scalar predicate selectivity; the pair generates query-rewrite hints.

Load-bearing premise

The correlations and neighborhood selectivity patterns the autoencoder learns stay stable and predictive on data and query workloads outside the extended benchmarks used in the evaluation.

What would settle it

Apply the method to a new collection where vector and scalar attributes are deliberately uncorrelated or where neighborhood selectivity patterns differ sharply from the training data, then measure whether the speedups vanish or recall falls below the target.

read the original abstract

Hybrid queries, which combine vector nearest neighbor searches with scalar predicates, represent a fundamental challenge in managing vector databases. Existing methods often restrict the number of vector columns involved or the complexity of scalar predicates, thereby limiting their flexibility in handling diverse query patterns. Moreover, these approaches typically do not fully leverage the correlations between scalar and vector attributes, or the distributional patterns observed from query vector neighborhoods. To address these limitations, we introduce BoomHQ, a learning-based framework to boost multiple hybrid queries on vector DBMSs. First, BoomHQ models the correlation between vector and scalar attributes using an autoencoder-based architecture, which is also friendly to data updates. Second, BoomHQ captures prevailing query patterns, particularly using estimated selectivity of scalar predicates within the neighborhood of a query vector. Guided by these two key features, BoomHQ predicts the execution hints and rewrites the original query into an optimized version. Furthermore, we extend well-known benchmarks by introducing vector and scalar data with inherent correlations to better evaluate query execution. Experimental results demonstrate that for multiple hybrid queries at specified recall thresholds, our method achieves a 2x average and over 25x peak speedup compared to the state-of-the-art. Additionally, BoomHQ shows strong robustness against data updates and consistent optimization effectiveness across three representative vector database systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. BoomHQ is a learning-based framework for optimizing hybrid queries (vector nearest-neighbor search combined with scalar predicates) on vector DBMSs. It trains an autoencoder to capture correlations between vector and scalar attributes and estimates scalar-predicate selectivity inside the neighborhood of a query vector; these features are used to predict execution hints that rewrite the original query. The paper extends standard benchmarks by injecting vector-scalar correlations, evaluates on three representative vector databases, and reports a 2× average and >25× peak speedup at fixed recall thresholds together with robustness to data updates.

Significance. If the reported speedups and robustness hold under broader conditions, the work would be significant: hybrid queries are a core workload for vector databases, and a practical learning-based rewriting approach that exploits both attribute correlations and local selectivity could improve query performance without sacrificing recall. The multi-system evaluation and explicit handling of updates are positive features.

major comments (3)
  1. [§5] §5 (Experimental Evaluation): The central speedup claims rest on benchmarks that the authors themselves extended by deliberately introducing vector-scalar correlations. No cross-distribution transfer experiments, sensitivity analysis to correlation strength, or evaluation on unmodified real-world datasets are described; therefore the generalization of the autoencoder-learned features and neighborhood selectivity estimates to unseen data distributions remains unproven and directly affects the reliability of the 2×/25× claims.
  2. [§4.2] §4.2 (Query Rewriting and Hint Prediction): The paper states that the autoencoder is “friendly to data updates,” yet provides no quantitative evaluation of how often the model must be retrained, how update batches affect prediction accuracy, or whether the reported robustness holds when updates alter the learned correlations. This is load-bearing for the robustness claim.
  3. [Table 2] Table 2 / Figure 6 (Speedup results): The reported speedups lack error bars, confidence intervals, or statistical significance tests across the 22 query templates. Without these, it is impossible to judge whether the 2× average and 25× peak figures are stable or sensitive to particular query or data realizations.
minor comments (2)
  1. [§3.2] The notation for neighborhood selectivity (e.g., S(q, r)) is introduced without a clear formal definition or pseudocode; a small example would improve readability.
  2. [Figure 4] Figure 4 (autoencoder architecture) uses inconsistent font sizes and omits the exact loss function and training hyperparameters; these should be stated explicitly or referenced to an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Evaluation): The central speedup claims rest on benchmarks that the authors themselves extended by deliberately introducing vector-scalar correlations. No cross-distribution transfer experiments, sensitivity analysis to correlation strength, or evaluation on unmodified real-world datasets are described; therefore the generalization of the autoencoder-learned features and neighborhood selectivity estimates to unseen data distributions remains unproven and directly affects the reliability of the 2×/25× claims.

    Authors: We agree that our evaluation relies on extended benchmarks with deliberately injected correlations to isolate and demonstrate the value of modeling vector-scalar correlations and neighborhood selectivity. The autoencoder is explicitly designed to learn and exploit correlations present in the data; on uncorrelated data the method is expected to fall back to baseline performance, which aligns with its intended use case for hybrid workloads where such correlations commonly exist. To strengthen the generalization claims, we will add (1) a sensitivity analysis varying correlation strength and (2) cross-distribution transfer experiments (train on one synthetic distribution, test on another). For unmodified real-world datasets, publicly available vector collections with rich scalar attributes and documented correlations are scarce; we will evaluate on additional real-world-inspired datasets where possible and explicitly discuss this limitation and its implications in the revised paper. revision: partial

  2. Referee: [§4.2] §4.2 (Query Rewriting and Hint Prediction): The paper states that the autoencoder is “friendly to data updates,” yet provides no quantitative evaluation of how often the model must be retrained, how update batches affect prediction accuracy, or whether the reported robustness holds when updates alter the learned correlations. This is load-bearing for the robustness claim.

    Authors: We acknowledge that the current manuscript states the autoencoder is friendly to updates (via its architecture supporting periodic retraining) but does not provide quantitative measurements of retraining frequency, accuracy degradation under update batches, or behavior when updates change the underlying correlations. In the revision we will add dedicated experiments that measure prediction accuracy as a function of update batch size and frequency, and that evaluate robustness when updates deliberately alter learned correlations. These results will be reported alongside the existing robustness claims to make the evidence concrete. revision: yes

  3. Referee: [Table 2] Table 2 / Figure 6 (Speedup results): The reported speedups lack error bars, confidence intervals, or statistical significance tests across the 22 query templates. Without these, it is impossible to judge whether the 2× average and 25× peak figures are stable or sensitive to particular query or data realizations.

    Authors: We agree that the absence of error bars, confidence intervals, and statistical tests limits the ability to assess result stability. In the revised version we will augment Table 2 and Figure 6 with error bars (standard deviation across repeated runs or query instances) and include statistical significance tests (e.g., paired t-tests) comparing BoomHQ against baselines across the 22 query templates. This will allow readers to evaluate whether the reported 2× average and >25× peak speedups are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical learning-based framework.

full rationale

The paper describes an empirical, learning-based system that trains an autoencoder on data to capture vector-scalar correlations and neighborhood selectivity, then uses the trained model to generate execution hints for query rewriting. All performance claims (2x average speedup, 25x peak) are measured outcomes against external baselines on concrete vector DBMSs and author-extended benchmarks; no closed-form derivation, self-definitional equations, or fitted parameters renamed as predictions appear. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced in the text. The approach is self-contained as a data-driven optimizer whose validity rests on experimental falsifiability rather than reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that learned correlations and selectivity estimates transfer to unseen queries and remain robust to updates; no explicit free parameters or invented entities are named in the abstract, but the autoencoder training implicitly introduces hyperparameters and the benchmark extension introduces new data distributions.

free parameters (1)
  • autoencoder hyperparameters
    Training choices for the autoencoder that models vector-scalar correlations are not detailed but must be selected to achieve the reported performance.
axioms (2)
  • domain assumption Correlations between vector and scalar attributes exist and can be captured by an autoencoder in a manner stable to data updates.
    Invoked to justify the first modeling step and update-friendliness claim.
  • domain assumption Query vector neighborhoods provide reliable estimates of scalar predicate selectivity.
    Central to the second feature used for hint prediction.

pith-pipeline@v0.9.0 · 5540 in / 1486 out tokens · 47943 ms · 2026-05-07T17:06:41.025812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references

  1. [1]

    In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp

    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)

  2. [2]

    Journal of the ACM (JACM)45(6), 891–923 (1998)

    Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM)45(6), 891–923 (1998)

  3. [3]

    Proceedings of the VLDB Endowment14(11), 1964–1978 (2021)

    Wang, M., Xu, X., Yue, Q., Wang, Y.: A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proceedings of the VLDB Endowment14(11), 1964–1978 (2021)

  4. [4]

    Information Processing & Management63(2), 104459 (2026)

    Li, Y., Wang, S., Chen, Z., Peng, Z.: Efficient low-rank index routing for high- dimensional approximate nearest neighbor search. Information Processing & Management63(2), 104459 (2026)

  5. [5]

    World Wide Web29(1), 8 (2026)

    Zhu, S., Wang, Y., Jin, X., Zeng, J., Wang, S., Sun, Y., Lai, Y., Peng, Z.: Braveann: Robust approximate nearest neighbor search for billion-scale vectors. World Wide Web29(1), 8 (2026)

  6. [6]

    IEEE Transactions on Knowledge and Data Engineering32(8), 1475–1488 (2019)

    Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering32(8), 1475–1488 (2019)

  7. [7]

    IEEE transactions on pattern analysis and machine intelligence42(4), 824–836 (2018)

    Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence42(4), 824–836 (2018)

  8. [8]

    Information Systems87, 101374 (2020)

    Aum¨ uller, M., Bernhardsson, E., Faithfull, A.: Ann-benchmarks: A benchmark- ing tool for approximate nearest neighbor algorithms. Information Systems87, 101374 (2020)

  9. [9]

    In: Proceedings of the 2021 International Conference on Management of Data, pp

    Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2614–2627 (2021) 24

  10. [10]

    https://weaviate.io/

  11. [11]

    https://www.pinecone.io/

  12. [12]

    https://github.com/pgvector/pgvector

  13. [13]

    https://opensearch.org/

  14. [14]

    https://www.elastic.co/elasticsearch/

  15. [15]

    Proceedings of the VLDB Endowment18(4), 1118–1130 (2024)

    Liang, A., Zhang, P., Yao, B., Chen, Z., Song, Y., Cheng, G.: Unify: Unified index for range filtered approximate nearest neighbors search. Proceedings of the VLDB Endowment18(4), 1118–1130 (2024)

  16. [16]

    In: Proceedings of the ACM Web Conference 2023, pp

    Gollapudi, S., Karia, N., Sivashankar, V., Krishnaswamy, R., Begwani, N., Raz, S., Lin, Y., Zhang, Y., Mahapatro, N., Srinivasan, P.,et al.: Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters. In: Proceedings of the ACM Web Conference 2023, pp. 3406–3416 (2023)

  17. [17]

    In: 2025 IEEE 41st International Conference on Data Engineering (ICDE), pp

    Qiu, E., Gao, J., Tu, Y., Yang, J.: Liftus: An adaptive multi-aspect column rep- resentation learning for table union search. In: 2025 IEEE 41st International Conference on Data Engineering (ICDE), pp. 2174–2187 (2025). IEEE

  18. [18]

    Proceedings of the ACM on Management of Data3(1), 1–28 (2025)

    Yin, Z., Gao, J., Balsebre, P., Cong, G., Long, C.: Deg: Efficient hybrid vector search using the dynamic edge navigation graph. Proceedings of the ACM on Management of Data3(1), 1–28 (2025)

  19. [19]

    In: 2025 IEEE 41st International Conference on Data Engineering (ICDE), pp

    Fu, Y., Chen, C., Chen, Y., Wong, W.-F., He, B.: Vista: Vector indexing and search for large-scale imbalanced datasets. In: 2025 IEEE 41st International Conference on Data Engineering (ICDE), pp. 543–556 (2025). IEEE

  20. [20]

    Proceedings of the VLDB Endowment18(12), 5488–5492 (2025)

    Chronis, Y., Caminal, H., Papakonstantinou, Y., ¨Ozcan, F., Ailamaki, A.: Fil- tered vector search: State-of-the-art and research opportunities. Proceedings of the VLDB Endowment18(12), 5488–5492 (2025)

  21. [21]

    Proceedings of the VLDB Endowment13(12), 3152–3165 (2020)

    Wei, C., Wu, B., Wang, S., Lou, R., Zhan, C., Li, F., Cai, Y.: Analyticdb-v: A hybrid analytical engine towards query fusion for structured and unstructured data. Proceedings of the VLDB Endowment13(12), 3152–3165 (2020)

  22. [23]

    In: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pp

    Zhang, Q., Xu, S., Chen, Q., Sui, G., Xie, J., Cai, Z., Chen, Y., He, Y., Yang, Y., Yang, F.,et al.:{VBASE}: Unifying online vector similarity search and relational queries via relaxed monotonicity. In: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pp. 377–395 (2023) 25

  23. [24]

    https://hybridqueriesbenchmark.github.io/index.html

  24. [25]

    IEEE transactions on pattern analysis and machine intelligence33(1), 117–128 (2010)

    Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence33(1), 117–128 (2010)

  25. [26]

    In: International Conference on Similarity Search and Applications, pp

    Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V.: Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In: International Conference on Similarity Search and Applications, pp. 132–147 (2012). Springer

  26. [27]

    Information Systems 45, 61–68 (2014)

    Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V.: Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45, 61–68 (2014)

  27. [28]

    In: International Conference on Information and Communication Technologies & Applications, vol

    Ponomarenko, A., Malkov, Y., Logvinov, A., Krylov, V.: Approximate near- est neighbor search small world approach. In: International Conference on Information and Communication Technologies & Applications, vol. 17 (2011)

  28. [29]

    Advances in neural information processing Systems32(2019)

    Jayaram Subramanya, S., Devvrit, F., Simhadri, H.V., Krishnawamy, R., Kadekodi, R.: Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems32(2019)

  29. [30]

    Advances in Neural Information Processing Systems 33, 10672–10684 (2020)

    Ren, J., Zhang, M., Li, D.: Hm-ann: Efficient billion-point nearest neighbor search on heterogeneous memory. Advances in Neural Information Processing Systems 33, 10672–10684 (2020)

  30. [31]

    Proceedings of the VLDB Endowment 12(5), 461–474 (2019)

    Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment 12(5), 461–474 (2019)

  31. [32]

    Proceedings of the ACM on Management of Data2(1), 1–26 (2024)

    Zuo, C., Qiao, M., Zhou, W., Li, F., Deng, D.: Serf: segment graph for range- filtering approximate nearest neighbor search. Proceedings of the ACM on Management of Data2(1), 1–26 (2024)

  32. [33]

    Proceedings of the ACM on Management of Data2(6), 1–26 (2024)

    Xu, Y., Gao, J., Gou, Y., Long, C., Jensen, C.S.: irangegraph: Improvising range- dedicated graphs for range-filtering nearest neighbor search. Proceedings of the ACM on Management of Data2(6), 1–26 (2024)

  33. [34]

    Proceedings of the ACM on Management of Data2(6), 1–27 (2024)

    Cai, Y., Shi, J., Chen, Y., Zheng, W.: Navigating labels and vectors: A unified approach to filtered approximate nearest neighbor search. Proceedings of the ACM on Management of Data2(6), 1–27 (2024)

  34. [35]

    Proceedings of the ACM on Management of Data2(3), 1–27 (2024) 26

    Patel, L., Kraft, P., Guestrin, C., Zaharia, M.: Acorn: Performant and predicate- agnostic search over vector embeddings and structured data. Proceedings of the ACM on Management of Data2(3), 1–27 (2024) 26

  35. [36]

    In: Proceedings of the 2021 International Conference on Management of Data, pp

    Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1745–1757 (2021)

  36. [37]

    In: 2023 IEEE 39th International Conference on Data Engineering (ICDE), pp

    Zheng, B., Yue, Z., Hu, Q., Yi, X., Luan, X., Xie, C., Zhou, X., Jensen, C.S.: Learned probing cardinality estimation for high-dimensional approximate nn search. In: 2023 IEEE 39th International Conference on Data Engineering (ICDE), pp. 3209–3221 (2023). IEEE

  37. [38]

    In: Proceedings of the ACM on Web Conference 2025, pp

    Zeng, X., Deng, L., Chen, P., Chen, X., Su, H., Zheng, K.: Lira: A learning-based query-aware partition framework for large-scale ann search. In: Proceedings of the ACM on Web Conference 2025, pp. 2729–2741 (2025)

  38. [39]

    In: Proceedings of the 41st International Conference on Machine Learning, pp

    Engels, J., Landrum, B., Yu, S., Dhulipala, L., Shun, J.: Approximate nearest neighbor search with window filters. In: Proceedings of the 41st International Conference on Machine Learning, pp. 12469–12490 (2024)

  39. [40]

    In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp

    Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)

  40. [41]

    https://www.tpc.org/tpch/ 27