pith. machine review for the scientific record. sign in

arxiv: 2604.15484 · v1 · submitted 2026-04-16 · 💻 cs.IR

Recognition: unknown

vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

Jayson Steffens

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:44 UTC · model grok-4.3

classification 💻 cs.IR
keywords hybrid retrievalreciprocal rank fusionself-supervised fine-tuningembedding modelsBEIR benchmarkvector searchfull-text searchlocal-first systems
0
0 comments X

The pith

Hybrid search disagreement supplies free training triples that let a 33M-parameter embedding model match or exceed larger systems on BEIR benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces vstash as a local-first document store that runs vector similarity and full-text keyword search inside a single SQLite file. It observes that vector-heavy and keyword-heavy rankings disagree on the top-10 results for 74.5 percent of queries across three BEIR collections, creating unlabeled training triples at no extra cost. Fine-tuning the 33-million-parameter BGE-small model on 76,000 such triples using MultipleNegativesRankingLoss raises NDCG@10 on all five tested BEIR datasets, with the largest gain of 19.5 percent on NFCorpus. The same tuned pipeline equals or beats published ColBERTv2 results on three of the five datasets under certain preprocessing. Adaptive per-query IDF weighting inside reciprocal rank fusion adds further gains, while several post-fusion reranking methods show no benefit.

Core claim

The central claim is that disagreement between vector-heavy and FTS-heavy retrieval runs on the same queries generates high-quality self-supervised training triples; fine-tuning a small embedding model on these triples, together with adaptive-IDF RRF, produces competitive or superior NDCG scores across BEIR datasets while remaining fully local and low-latency.

What carries the argument

Self-supervised embedding refinement via top-10 disagreement triples between vector-heavy (vec=0.95) and FTS-heavy (vec=0.05) runs, used to fine-tune with MultipleNegativesRankingLoss, combined with per-query IDF-weighted reciprocal rank fusion inside a single-file SQLite substrate.

If this is right

  • The fine-tuned 33M-parameter model improves NDCG@10 on every BEIR dataset tested and matches or exceeds published 110M-parameter ColBERTv2 results on three of five datasets under different preprocessing.
  • Adaptive per-query IDF weighting inside RRF raises NDCG@10 over fixed-weight RRF on all five datasets, with a 21.4 percent gain on ArguAna.
  • Post-RRF techniques such as frequency-plus-decay scoring, history-augmented recall, and cross-encoder reranking produce no NDCG improvement.
  • Median search latency stays at 20.9 ms for 50K chunks while NDCG remains stable, and a distance-based relevance signal is validated on 50,425 judged queries.
  • The entire pipeline, including the fine-tuned model, runs from a single SQLite file with integrity checks and schema versioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disagreement-triple extraction could be applied to any hybrid retrieval system that supports independent vector and keyword runs, turning an existing source of ranking conflict into training data without new annotation.
  • Local agents that maintain their own document memory could repeatedly mine disagreements from their usage logs to keep small embeddings current without cloud calls.
  • If disagreement rates remain high across domains, the approach offers a route to competitive retrieval performance while avoiding the parameter count and latency of large dense models.
  • The validated distance-based relevance signal could serve as an online diagnostic for when a retrieval index needs re-tuning or additional documents.

Load-bearing premise

The 74.5 percent top-10 disagreement rate between vector and full-text runs supplies high-quality, generalizable training triples rather than dataset-specific artifacts or search-method biases.

What would settle it

Repeating the disagreement-triple extraction and fine-tuning procedure on an entirely new retrieval benchmark and finding that NDCG@10 does not improve over the untuned base model would falsify the claim that the method produces generalizable gains.

Figures

Figures reproduced from arXiv: 2604.15484 by Jayson Steffens.

Figure 1
Figure 1. Figure 1: vstash architecture. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

We present **vstash**, a local-first document memory system that combines vector similarity search with full-text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per-query IDF weighting. All data resides in a single SQLite file using sqlite-vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. **(1)** Self-supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top-10 disagreement between vector-heavy (vec=0.95, fts=0.05) and FTS-heavy (vec=0.05, fts=0.95) search (per-dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine-tuning BGE-small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE-small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M-parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE-base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). **(2)** Adaptive RRF with per-query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE-small. **(3)** A negative result on post-RRF scoring: frequency+decay, history-augmented recall, and cross-encoder reranking all failed to improve NDCG. **(4)** A production-grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance-based relevance signal validated on 50,425 relevance-judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine-tuned model is published as `Stffens/bge-small-rrf-v2` on HuggingFace. All code, data, and experiments are open-source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents vstash, a local-first hybrid retrieval system for LLM agents that fuses vector similarity (via sqlite-vec) and full-text search (FTS5) using Reciprocal Rank Fusion (RRF) with adaptive per-query IDF weighting, all stored in a single SQLite file. It reports four contributions: (1) self-supervised fine-tuning of BGE-small (33M params) on 76K disagreement triples mined from 74.5% top-10 disagreements between vector-heavy and FTS-heavy RRF runs on SciFact/NFCorpus/FiQA, yielding NDCG@10 gains on all five BEIR datasets (up to +19.5% on NFCorpus vs. base RRF) and matching/exceeding ColBERTv2 on three datasets under varied preprocessing; (2) adaptive RRF improving NDCG@10 over fixed weights (up to +21.4% on ArguAna); (3) negative results on post-RRF ideas like frequency+decay and cross-encoder reranking; (4) a production substrate with integrity checks, diagnostics, and validation on 50,425 judged queries, with 20.9 ms median latency at 50K chunks. The fine-tuned model is released as Stffens/bge-small-rrf-v2.

Significance. If the disagreement-based training signal holds, the work provides a practical, label-free route to refine small embedding models for hybrid local retrieval, potentially closing the gap with larger models like ColBERTv2 while emphasizing efficiency and reproducibility. Credit is due for the open release of code/data/model, the explicit negative results on post-fusion methods, the latency and scale validation, and the focus on a self-contained SQLite substrate suitable for agent memory.

major comments (1)
  1. [Section 5.5 / Table 6] Section 5.5 / Table 6: The headline claim that fine-tuning on the 76K disagreement triples supplies a 'free training signal without human labels' and drives the NDCG@10 lifts (including +19.5% on NFCorpus) is load-bearing for contribution (1), yet no ablation is reported comparing these triples against random negatives, BM25 hard negatives, or the original BGE training negatives. Likewise, no split is shown evaluating gains only on the two held-out datasets (ArguAna, SciDocs) versus the three source datasets used for triple mining. Without these checks, it remains possible that the observed improvements reflect dataset-specific interactions between BGE embeddings and FTS5 tokenization rather than generalizable relevance signals.
minor comments (2)
  1. [Abstract / Section 5.5] Abstract and Section 5.5: The comparisons to published ColBERTv2 results note 'under different preprocessing'; this should be explicitly defined (e.g., chunking strategy, query preprocessing) to allow direct replication of the 'matches or exceeds' claim.
  2. [Section 5.2] Section 5.2: The per-dataset disagreement rates (63.4%/73.4%/86.7%) are given, but the total 753 queries and exact construction of the 76K triples (e.g., how many per query, positive/negative selection) would benefit from a small table or pseudocode for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment below and will strengthen the manuscript with the requested analyses.

read point-by-point responses
  1. Referee: [Section 5.5 / Table 6] Section 5.5 / Table 6: The headline claim that fine-tuning on the 76K disagreement triples supplies a 'free training signal without human labels' and drives the NDCG@10 lifts (including +19.5% on NFCorpus) is load-bearing for contribution (1), yet no ablation is reported comparing these triples against random negatives, BM25 hard negatives, or the original BGE training negatives. Likewise, no split is shown evaluating gains only on the two held-out datasets (ArguAna, SciDocs) versus the three source datasets used for triple mining. Without these checks, it remains possible that the observed improvements reflect dataset-specific interactions between BGE embeddings and FTS5 tokenization rather than generalizable relevance signals.

    Authors: We agree that the current manuscript lacks these ablations and splits, which would provide stronger validation of the disagreement triples as a generalizable training signal. In the revised version, we will add comparisons of the 76K triples against random negatives, BM25 hard negatives, and the original BGE training negatives. We will also report NDCG@10 results broken down by the source datasets (SciFact, NFCorpus, FiQA) versus the held-out datasets (ArguAna, SciDocs) to demonstrate whether gains transfer beyond the mining distribution. These additions will clarify that improvements such as the +19.5% on NFCorpus stem from hybrid disagreement signals rather than dataset-specific interactions with FTS5 tokenization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all gains measured against external ground truth

full rationale

The paper's derivation chain generates 76K disagreement triples from base BGE-small RRF runs (vec/fts weightings) on three BEIR datasets and fine-tunes with MultipleNegativesRankingLoss, then reports NDCG@10 lifts on all five BEIR datasets against their independent human relevance judgments. No central quantity (NDCG improvement, adaptive RRF weights, or disagreement rate) is defined in terms of a fitted parameter that is then re-used as evidence; the evaluation metrics and ColBERTv2 comparisons remain external to the training heuristic. No self-citation, ansatz smuggling, or self-definitional reduction appears in the claims or equations. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard IR assumptions (RRF effectiveness, BEIR as representative benchmarks) and introduces no new postulated entities. The only free parameters are the two extreme weight pairs (0.95/0.05 and 0.05/0.95) chosen to surface disagreements; these are not fitted to the final metric.

free parameters (1)
  • vector vs FTS weight extremes
    Chosen to detect disagreement rather than optimized against the final NDCG.
axioms (1)
  • domain assumption Reciprocal Rank Fusion is a reliable way to combine vector and keyword rankings
    Invoked without re-derivation in the hybrid fusion step.

pith-pipeline@v0.9.0 · 5763 in / 1382 out tokens · 57700 ms · 2026-05-10T09:44:24.852203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR

  2. [2]

    Chhikara, P., Khant, D., Aryan, S., & Singh, T. (2025). Mem0: Building production-ready AI agents with scalable long-term memory. arXiv:2504.19413

  3. [3]

    V., Clarke, C

    Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR

  4. [4]

    Ebbinghaus, H. (1885). Uber das Gedachtnis . Duncker & Humblot

  5. [5]

    Ma, X., Wang, Y., & Lin, J. (2024). Is reciprocal rank fusion all you need for hybrid retrieval? arXiv preprint

  6. [6]

    Alqithami, S. (2025). Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents. arXiv:2512.12856. 18

  7. [7]

    Sarin, S., Singh, L., Sarmah, B., & Mehta, D. (2025). Memoria: A scalable agentic memory framework for personalized conversational AI. arXiv:2512.12686

  8. [8]

    Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL

  9. [9]

    Packer, C., et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560

  10. [10]

    Dury, J. (2026). Predictive Associative Memory: Retrieval Beyond Similarity Through Tem- poral Co-occurrence. arXiv:2602.11322

  11. [11]

    Santhanam, K., et al. (2022). ColBERTv2: Effective and efficient retrieval via lightweight late interaction. NAACL

  12. [12]

    Thakur, N., et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS Datasets and Benchmarks

  13. [13]

    Rasmussen, P., Paliychuk, P., Beauvais, T., & Ryan, J. (2025). Zep: A temporal knowledge graph architecture for agent memory. arXiv:2501.13956

  14. [14]

    Xu, W., Liang, Z., Mei, K., & Gao, H. (2025). A-MEM: Agentic memory for LLM agents. arXiv:2502.12110

  15. [15]

    Karpukhin, V., et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP

  16. [16]

    Xiong, L., et al. (2020). Approximate nearest neighbor negative contrastive learning for dense text retrieval. ICLR 2021

  17. [17]

    Zhan, J., et al. (2021). Optimizing dense retrieval model training with hard negatives. SIGIR

  18. [18]

    Lee, M., et al. (2024). NV-Retriever: Improving text embedding models with effective hard- negative mining. arXiv:2407.15831

  19. [19]

    Chen, J., et al. (2024). BGE-M3: Multi-functionality, multi-linguality, and multi-granularity text embeddings through self-knowledge distillation. arXiv:2402.03216. 1.17 Acknowledgments vstash is built on sqlite-vec (Alex Garcia), FastEmbed (Qdrant), sentence-transformers and BAAI for embedding models, tree-sitter and parso for code-aware chunking, and SQ...