pith. machine review for the scientific record. sign in

arxiv: 2604.09982 · v2 · submitted 2026-04-11 · 💻 cs.IR · cs.CL· cs.LG

Recognition: unknown

Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG
keywords multi-vector retrievalColBERT-v2ConstBERTMaxSim operatorreproducibilityquery lengthMS-MARCOTREC ToT
0
0 comments X

The pith

ConstBERT reproduces on MS-MARCO but both it and ColBERT-v2 plateau on long queries because the MaxSim operator weights every token equally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ConstBERT and ColBERT-v2 remain effective outside the short-query benchmarks they were tuned on. It reports that ConstBERT matches prior numbers within 0.05 percent MRR@10 on MS-MARCO yet both models lose 86 to 97 percent of that score on longer narrative queries. Ablations tie the plateau at roughly 20 words directly to the MaxSim operator's uniform treatment of all tokens, which mixes signal with filler. The study also finds that extra fine-tuning data and hidden backend settings can widen rather than close the gap. A sympathetic reader would therefore conclude that architectural choices in these multi-vector models set hard limits that adaptation alone cannot remove.

Core claim

The central claim is that performance plateaus at approximately 20 words because the MaxSim operator applies the same weight to every token embedding, so that additional tokens in long queries add noise rather than information; this architectural limit persists across backends, query distributions, and even after fine-tuning with three times the original data, which in some cases lowers scores by up to 29 percent.

What carries the argument

The MaxSim operator, which selects the maximum cosine similarity between each query token embedding and any document token embedding and then sums those maxima without further weighting or selection.

Load-bearing premise

The ablations correctly isolate the MaxSim operator's uniform token weighting as the root cause of the performance plateau at 20 words, without confounding effects from query distribution shifts or backend variations.

What would settle it

Measure whether a modified MaxSim that down-weights or drops low-relevance tokens continues to improve beyond 20 words on the same TREC ToT 2025 narrative queries while keeping all other model components fixed.

read the original abstract

Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript reports a reproducibility study of ConstBERT and ColBERT-v2 across backends and query distributions. ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, but both models exhibit 86-97% drops on long narrative queries from TREC ToT 2025. Ablations attribute the performance plateau at ~20 words to the MaxSim operator's uniform token weighting, which cannot separate signal from filler noise. Additional results show an 8-point gap from undocumented backend parameters due to sparse centroid coverage and up to 29% degradation from fine-tuning with 3x more data. The central conclusion is that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone, supported by linked code.

Significance. If the central claims hold after controlling for potential confounds, the work would demonstrate important limits of adaptation in multi-vector retrieval models and the value of evaluating beyond standard benchmarks. The direct empirical measurements, controlled ablations, and provision of reproducible code are strengths that enable verification and extension.

major comments (1)
  1. [Ablation experiments] The ablation experiments (results section) claim that the performance plateau at 20 words is caused by MaxSim's uniform token weighting and cannot be overcome by adaptation. However, the long queries are drawn from TREC ToT 2025 (narrative style) while the MS-MARCO baselines use short queries; the ablations do not hold query distribution fixed while varying only the operator or weighting scheme. This leaves open the possibility that the observed plateau arises from distribution mismatch rather than the architectural feature itself, weakening the load-bearing claim that architectural constraints are the root cause.
minor comments (1)
  1. [Abstract] The abstract states performance drops of 86-97% and mentions ablations but provides no error bars, exact data splits, or pointers to full ablation tables, which would improve verifiability of the reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The feedback highlights an important methodological point about controlling for query distribution in our ablations, which we address directly below.

read point-by-point responses
  1. Referee: The ablation experiments (results section) claim that the performance plateau at 20 words is caused by MaxSim's uniform token weighting and cannot be overcome by adaptation. However, the long queries are drawn from TREC ToT 2025 (narrative style) while the MS-MARCO baselines use short queries; the ablations do not hold query distribution fixed while varying only the operator or weighting scheme. This leaves open the possibility that the observed plateau arises from distribution mismatch rather than the architectural feature itself, weakening the load-bearing claim that architectural constraints are the root cause.

    Authors: We agree that the primary length-based ablation compares across query distributions and does not isolate the MaxSim operator by varying only the weighting scheme while holding distribution fixed. However, within the TREC ToT 2025 long queries we do observe the plateau as token count increases from the same narrative distribution, and the additional results (backend parameter sweeps and 3x-data fine-tuning) show no recovery beyond ~20 words. These findings still support that MaxSim's uniform treatment of tokens limits separation of signal from filler. To directly address the concern, we will revise the manuscript with a new controlled ablation that holds the base query distribution fixed (using prefixes of the long queries or synthetic filler appendages on short queries) while measuring the effect of token count. This will be added to the results section with updated figures and discussion, strengthening the architectural claim without altering the core conclusions or code release. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical ablations and measurements

full rationale

The paper's central claims derive from controlled experiments measuring MRR@10 drops on long narrative queries, performance plateaus at 20 words, backend parameter gaps, and fine-tuning degradation. These are direct observations from reproduction runs across MS-MARCO and TREC ToT 2025, with ablations varying operators, data volume, and backends. No equations or predictions reduce by construction to fitted inputs; no self-citations serve as load-bearing uniqueness theorems; no ansatz or renaming of known results is presented as derivation. The chain is self-contained against external benchmarks and falsifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical reproduction study that relies on standard IR evaluation practices without introducing new theoretical parameters, axioms, or entities.

axioms (1)
  • domain assumption MRR@10 is a sufficient and unbiased metric for comparing retrieval performance across short and long query distributions.
    The paper uses MRR@10 to quantify both reproduction success and the 86-97% drops without discussing alternative metrics or their sensitivity to query length.

pith-pipeline@v0.9.0 · 5459 in / 1463 out tokens · 74217 ms · 2026-05-10T16:31:41.216777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Jaime Arguello, Fernando Diaz, Maik Fröebe, To Eun Kim, and Bhaskar Mitra

  2. [2]

    Overview of the TREC 2025 Tip-of-the-Tongue track.arXiv preprint arXiv:2601.20671(2026)

  3. [3]

    Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommen- dation Approaches. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 101–109. doi:10.1145/3298689.3347058

  4. [4]

    Gregor Donabauer and Udo Kruschwitz. 2025. A Reproducibility Study of Graph- Based Legal Case Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3135–3144

  5. [5]

    Manish Gupta and Michael Bendersky. 2015. Information Retrieval with Verbose Queries. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile)(SIGIR ’15). Association for Computing Machinery, New York, NY, USA, 1121–1124. doi:10.1145/2766462. 2767877

  6. [6]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Towards Unsupervised Dense Information Retrieval with Contrastive Learning.arXiv preprint arXiv:2112.09118 (2021)

  7. [7]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

  8. [8]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, China)(SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3...

  9. [9]

    Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InEuropean Conference on Information Retrieval. Springer, 237–245

  10. [10]

    Dipannita Podder, Jiaul Paik, and Pabitra Mitra. 2025. Learning Query Token Importance for Effective Document Retrieval with Verbose Queries. InProceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation (FIRE ’24). Association for Computing Machinery, New York, NY, USA, 67–75. doi:10.1145/3734947.3734954

  11. [11]

    Hongjin Qian and Zhicheng Dou. 2022. Explicit Query Rewriting for Conversa- tional Dense Retrieval. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4725–4737. doi:10.18653/v1/2022.emn...

  12. [12]

    Ning Sa and Xiaojun (Jenny) Yuan. 2021. Examining User Perception and Usage of Voice Search.Data and Information Management5, 1 (2021), 40–47. doi:10. 2478/dim-2020-0046

  13. [13]

    Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756

  14. [14]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

  15. [15]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

  16. [16]

    Zheng Yao, Shuai Wang, and Guido Zuccon. 2025. Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3276–3285

  17. [17]

    HongChien Yu, Chenyan Xiong, and Jamie Callan. 2021. Improving Query Repre- sentations for Dense Retrieval with Pseudo Relevance Feedback. InProceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 3592–3596. doi:10.1...