Recognition: unknown
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
Pith reviewed 2026-05-10 09:44 UTC · model grok-4.3
The pith
Hybrid search disagreement supplies free training triples that let a 33M-parameter embedding model match or exceed larger systems on BEIR benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that disagreement between vector-heavy and FTS-heavy retrieval runs on the same queries generates high-quality self-supervised training triples; fine-tuning a small embedding model on these triples, together with adaptive-IDF RRF, produces competitive or superior NDCG scores across BEIR datasets while remaining fully local and low-latency.
What carries the argument
Self-supervised embedding refinement via top-10 disagreement triples between vector-heavy (vec=0.95) and FTS-heavy (vec=0.05) runs, used to fine-tune with MultipleNegativesRankingLoss, combined with per-query IDF-weighted reciprocal rank fusion inside a single-file SQLite substrate.
If this is right
- The fine-tuned 33M-parameter model improves NDCG@10 on every BEIR dataset tested and matches or exceeds published 110M-parameter ColBERTv2 results on three of five datasets under different preprocessing.
- Adaptive per-query IDF weighting inside RRF raises NDCG@10 over fixed-weight RRF on all five datasets, with a 21.4 percent gain on ArguAna.
- Post-RRF techniques such as frequency-plus-decay scoring, history-augmented recall, and cross-encoder reranking produce no NDCG improvement.
- Median search latency stays at 20.9 ms for 50K chunks while NDCG remains stable, and a distance-based relevance signal is validated on 50,425 judged queries.
- The entire pipeline, including the fine-tuned model, runs from a single SQLite file with integrity checks and schema versioning.
Where Pith is reading between the lines
- The same disagreement-triple extraction could be applied to any hybrid retrieval system that supports independent vector and keyword runs, turning an existing source of ranking conflict into training data without new annotation.
- Local agents that maintain their own document memory could repeatedly mine disagreements from their usage logs to keep small embeddings current without cloud calls.
- If disagreement rates remain high across domains, the approach offers a route to competitive retrieval performance while avoiding the parameter count and latency of large dense models.
- The validated distance-based relevance signal could serve as an online diagnostic for when a retrieval index needs re-tuning or additional documents.
Load-bearing premise
The 74.5 percent top-10 disagreement rate between vector and full-text runs supplies high-quality, generalizable training triples rather than dataset-specific artifacts or search-method biases.
What would settle it
Repeating the disagreement-triple extraction and fine-tuning procedure on an entirely new retrieval benchmark and finding that NDCG@10 does not improve over the untuned base model would falsify the claim that the method produces generalizable gains.
Figures
read the original abstract
We present **vstash**, a local-first document memory system that combines vector similarity search with full-text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per-query IDF weighting. All data resides in a single SQLite file using sqlite-vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. **(1)** Self-supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top-10 disagreement between vector-heavy (vec=0.95, fts=0.05) and FTS-heavy (vec=0.05, fts=0.95) search (per-dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine-tuning BGE-small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE-small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M-parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE-base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). **(2)** Adaptive RRF with per-query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE-small. **(3)** A negative result on post-RRF scoring: frequency+decay, history-augmented recall, and cross-encoder reranking all failed to improve NDCG. **(4)** A production-grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance-based relevance signal validated on 50,425 relevance-judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine-tuned model is published as `Stffens/bge-small-rrf-v2` on HuggingFace. All code, data, and experiments are open-source.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents vstash, a local-first hybrid retrieval system for LLM agents that fuses vector similarity (via sqlite-vec) and full-text search (FTS5) using Reciprocal Rank Fusion (RRF) with adaptive per-query IDF weighting, all stored in a single SQLite file. It reports four contributions: (1) self-supervised fine-tuning of BGE-small (33M params) on 76K disagreement triples mined from 74.5% top-10 disagreements between vector-heavy and FTS-heavy RRF runs on SciFact/NFCorpus/FiQA, yielding NDCG@10 gains on all five BEIR datasets (up to +19.5% on NFCorpus vs. base RRF) and matching/exceeding ColBERTv2 on three datasets under varied preprocessing; (2) adaptive RRF improving NDCG@10 over fixed weights (up to +21.4% on ArguAna); (3) negative results on post-RRF ideas like frequency+decay and cross-encoder reranking; (4) a production substrate with integrity checks, diagnostics, and validation on 50,425 judged queries, with 20.9 ms median latency at 50K chunks. The fine-tuned model is released as Stffens/bge-small-rrf-v2.
Significance. If the disagreement-based training signal holds, the work provides a practical, label-free route to refine small embedding models for hybrid local retrieval, potentially closing the gap with larger models like ColBERTv2 while emphasizing efficiency and reproducibility. Credit is due for the open release of code/data/model, the explicit negative results on post-fusion methods, the latency and scale validation, and the focus on a self-contained SQLite substrate suitable for agent memory.
major comments (1)
- [Section 5.5 / Table 6] Section 5.5 / Table 6: The headline claim that fine-tuning on the 76K disagreement triples supplies a 'free training signal without human labels' and drives the NDCG@10 lifts (including +19.5% on NFCorpus) is load-bearing for contribution (1), yet no ablation is reported comparing these triples against random negatives, BM25 hard negatives, or the original BGE training negatives. Likewise, no split is shown evaluating gains only on the two held-out datasets (ArguAna, SciDocs) versus the three source datasets used for triple mining. Without these checks, it remains possible that the observed improvements reflect dataset-specific interactions between BGE embeddings and FTS5 tokenization rather than generalizable relevance signals.
minor comments (2)
- [Abstract / Section 5.5] Abstract and Section 5.5: The comparisons to published ColBERTv2 results note 'under different preprocessing'; this should be explicitly defined (e.g., chunking strategy, query preprocessing) to allow direct replication of the 'matches or exceeds' claim.
- [Section 5.2] Section 5.2: The per-dataset disagreement rates (63.4%/73.4%/86.7%) are given, but the total 753 queries and exact construction of the 76K triples (e.g., how many per query, positive/negative selection) would benefit from a small table or pseudocode for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment below and will strengthen the manuscript with the requested analyses.
read point-by-point responses
-
Referee: [Section 5.5 / Table 6] Section 5.5 / Table 6: The headline claim that fine-tuning on the 76K disagreement triples supplies a 'free training signal without human labels' and drives the NDCG@10 lifts (including +19.5% on NFCorpus) is load-bearing for contribution (1), yet no ablation is reported comparing these triples against random negatives, BM25 hard negatives, or the original BGE training negatives. Likewise, no split is shown evaluating gains only on the two held-out datasets (ArguAna, SciDocs) versus the three source datasets used for triple mining. Without these checks, it remains possible that the observed improvements reflect dataset-specific interactions between BGE embeddings and FTS5 tokenization rather than generalizable relevance signals.
Authors: We agree that the current manuscript lacks these ablations and splits, which would provide stronger validation of the disagreement triples as a generalizable training signal. In the revised version, we will add comparisons of the 76K triples against random negatives, BM25 hard negatives, and the original BGE training negatives. We will also report NDCG@10 results broken down by the source datasets (SciFact, NFCorpus, FiQA) versus the held-out datasets (ArguAna, SciDocs) to demonstrate whether gains transfer beyond the mining distribution. These additions will clarify that improvements such as the +19.5% on NFCorpus stem from hybrid disagreement signals rather than dataset-specific interactions with FTS5 tokenization. revision: yes
Circularity Check
No significant circularity; all gains measured against external ground truth
full rationale
The paper's derivation chain generates 76K disagreement triples from base BGE-small RRF runs (vec/fts weightings) on three BEIR datasets and fine-tunes with MultipleNegativesRankingLoss, then reports NDCG@10 lifts on all five BEIR datasets against their independent human relevance judgments. No central quantity (NDCG improvement, adaptive RRF weights, or disagreement rate) is defined in terms of a fitted parameter that is then re-used as evidence; the evaluation metrics and ColBERTv2 comparisons remain external to the training heuristic. No self-citation, ansatz smuggling, or self-definitional reduction appears in the claims or equations. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- vector vs FTS weight extremes
axioms (1)
- domain assumption Reciprocal Rank Fusion is a reliable way to combine vector and keyword rankings
Reference graph
Works this paper leans on
-
[1]
Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR
1998
-
[2]
Chhikara, P., Khant, D., Aryan, S., & Singh, T. (2025). Mem0: Building production-ready AI agents with scalable long-term memory. arXiv:2504.19413
work page internal anchor Pith review arXiv 2025
-
[3]
V., Clarke, C
Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR
2009
-
[4]
Ebbinghaus, H. (1885). Uber das Gedachtnis . Duncker & Humblot
-
[5]
Ma, X., Wang, Y., & Lin, J. (2024). Is reciprocal rank fusion all you need for hybrid retrieval? arXiv preprint
2024
- [6]
- [7]
-
[8]
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL
2023
-
[9]
Packer, C., et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560
work page internal anchor Pith review arXiv 2023
- [10]
-
[11]
Santhanam, K., et al. (2022). ColBERTv2: Effective and efficient retrieval via lightweight late interaction. NAACL
2022
-
[12]
Thakur, N., et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS Datasets and Benchmarks
2021
-
[13]
Rasmussen, P., Paliychuk, P., Beauvais, T., & Ryan, J. (2025). Zep: A temporal knowledge graph architecture for agent memory. arXiv:2501.13956
work page internal anchor Pith review arXiv 2025
-
[14]
Xu, W., Liang, Z., Mei, K., & Gao, H. (2025). A-MEM: Agentic memory for LLM agents. arXiv:2502.12110
work page internal anchor Pith review arXiv 2025
-
[15]
Karpukhin, V., et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP
2020
-
[16]
Xiong, L., et al. (2020). Approximate nearest neighbor negative contrastive learning for dense text retrieval. ICLR 2021
2020
-
[17]
Zhan, J., et al. (2021). Optimizing dense retrieval model training with hard negatives. SIGIR
2021
- [18]
-
[19]
Chen, J., et al. (2024). BGE-M3: Multi-functionality, multi-linguality, and multi-granularity text embeddings through self-knowledge distillation. arXiv:2402.03216. 1.17 Acknowledgments vstash is built on sqlite-vec (Alex Garcia), FastEmbed (Qdrant), sentence-transformers and BAAI for embedding models, tree-sitter and parso for code-aware chunking, and SQ...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.