arxiv: 2604.09982 · v2 · submitted 2026-04-11 · 💻 cs.IR · cs.CL· cs.LG

Recognition: unknown

Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

Utshab Kumar Ghosh , Ashish David , Shubham Chatterjee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG

keywords multi-vector retrievalColBERT-v2ConstBERTMaxSim operatorreproducibilityquery lengthMS-MARCOTREC ToT

0 comments

The pith

ConstBERT reproduces on MS-MARCO but both it and ColBERT-v2 plateau on long queries because the MaxSim operator weights every token equally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ConstBERT and ColBERT-v2 remain effective outside the short-query benchmarks they were tuned on. It reports that ConstBERT matches prior numbers within 0.05 percent MRR@10 on MS-MARCO yet both models lose 86 to 97 percent of that score on longer narrative queries. Ablations tie the plateau at roughly 20 words directly to the MaxSim operator's uniform treatment of all tokens, which mixes signal with filler. The study also finds that extra fine-tuning data and hidden backend settings can widen rather than close the gap. A sympathetic reader would therefore conclude that architectural choices in these multi-vector models set hard limits that adaptation alone cannot remove.

Core claim

The central claim is that performance plateaus at approximately 20 words because the MaxSim operator applies the same weight to every token embedding, so that additional tokens in long queries add noise rather than information; this architectural limit persists across backends, query distributions, and even after fine-tuning with three times the original data, which in some cases lowers scores by up to 29 percent.

What carries the argument

The MaxSim operator, which selects the maximum cosine similarity between each query token embedding and any document token embedding and then sums those maxima without further weighting or selection.

Load-bearing premise

The ablations correctly isolate the MaxSim operator's uniform token weighting as the root cause of the performance plateau at 20 words, without confounding effects from query distribution shifts or backend variations.

What would settle it

Measure whether a modified MaxSim that down-weights or drops low-relevance tokens continues to improve beyond 20 words on the same TREC ToT 2025 narrative queries while keeping all other model components fixed.

read the original abstract

Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reproduction shows ColBERT models tank on long narrative queries with ablations pointing to MaxSim, but distribution shift may explain part of the plateau.

read the letter

The key point is that ConstBERT and ColBERT-v2 match prior numbers on MS-MARCO but drop 86-97% on the longer TREC ToT 2025 narrative queries, and the ablations tie the 20-word performance plateau to the MaxSim operator's uniform token weighting rather than insufficient adaptation. More fine-tuning data even hurts results by up to 29%, and backend parameters create an 8-point gap in ConstBERT due to sparse centroids. The linked code makes the numbers checkable, which is a plus for a reproduction study.

Referee Report

1 major / 1 minor

Summary. The manuscript reports a reproducibility study of ConstBERT and ColBERT-v2 across backends and query distributions. ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, but both models exhibit 86-97% drops on long narrative queries from TREC ToT 2025. Ablations attribute the performance plateau at ~20 words to the MaxSim operator's uniform token weighting, which cannot separate signal from filler noise. Additional results show an 8-point gap from undocumented backend parameters due to sparse centroid coverage and up to 29% degradation from fine-tuning with 3x more data. The central conclusion is that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone, supported by linked code.

Significance. If the central claims hold after controlling for potential confounds, the work would demonstrate important limits of adaptation in multi-vector retrieval models and the value of evaluating beyond standard benchmarks. The direct empirical measurements, controlled ablations, and provision of reproducible code are strengths that enable verification and extension.

major comments (1)

[Ablation experiments] The ablation experiments (results section) claim that the performance plateau at 20 words is caused by MaxSim's uniform token weighting and cannot be overcome by adaptation. However, the long queries are drawn from TREC ToT 2025 (narrative style) while the MS-MARCO baselines use short queries; the ablations do not hold query distribution fixed while varying only the operator or weighting scheme. This leaves open the possibility that the observed plateau arises from distribution mismatch rather than the architectural feature itself, weakening the load-bearing claim that architectural constraints are the root cause.

minor comments (1)

[Abstract] The abstract states performance drops of 86-97% and mentions ablations but provides no error bars, exact data splits, or pointers to full ablation tables, which would improve verifiability of the reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The feedback highlights an important methodological point about controlling for query distribution in our ablations, which we address directly below.

read point-by-point responses

Referee: The ablation experiments (results section) claim that the performance plateau at 20 words is caused by MaxSim's uniform token weighting and cannot be overcome by adaptation. However, the long queries are drawn from TREC ToT 2025 (narrative style) while the MS-MARCO baselines use short queries; the ablations do not hold query distribution fixed while varying only the operator or weighting scheme. This leaves open the possibility that the observed plateau arises from distribution mismatch rather than the architectural feature itself, weakening the load-bearing claim that architectural constraints are the root cause.

Authors: We agree that the primary length-based ablation compares across query distributions and does not isolate the MaxSim operator by varying only the weighting scheme while holding distribution fixed. However, within the TREC ToT 2025 long queries we do observe the plateau as token count increases from the same narrative distribution, and the additional results (backend parameter sweeps and 3x-data fine-tuning) show no recovery beyond ~20 words. These findings still support that MaxSim's uniform treatment of tokens limits separation of signal from filler. To directly address the concern, we will revise the manuscript with a new controlled ablation that holds the base query distribution fixed (using prefixes of the long queries or synthetic filler appendages on short queries) while measuring the effect of token count. This will be added to the results section with updated figures and discussion, strengthening the architectural claim without altering the core conclusions or code release. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical ablations and measurements

full rationale

The paper's central claims derive from controlled experiments measuring MRR@10 drops on long narrative queries, performance plateaus at 20 words, backend parameter gaps, and fine-tuning degradation. These are direct observations from reproduction runs across MS-MARCO and TREC ToT 2025, with ablations varying operators, data volume, and backends. No equations or predictions reduce by construction to fitted inputs; no self-citations serve as load-bearing uniqueness theorems; no ansatz or renaming of known results is presented as derivation. The chain is self-contained against external benchmarks and falsifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical reproduction study that relies on standard IR evaluation practices without introducing new theoretical parameters, axioms, or entities.

axioms (1)

domain assumption MRR@10 is a sufficient and unbiased metric for comparing retrieval performance across short and long query distributions.
The paper uses MRR@10 to quantify both reproduction success and the 86-97% drops without discussing alternative metrics or their sensitivity to query length.

pith-pipeline@v0.9.0 · 5459 in / 1463 out tokens · 74217 ms · 2026-05-10T16:31:41.216777+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Jaime Arguello, Fernando Diaz, Maik Fröebe, To Eun Kim, and Bhaskar Mitra
[2]

Overview of the TREC 2025 Tip-of-the-Tongue track.arXiv preprint arXiv:2601.20671(2026)

work page arXiv 2025
[3]

Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommen- dation Approaches. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 101–109. doi:10.1145/3298689.3347058

work page doi:10.1145/3298689.3347058 2019
[4]

Gregor Donabauer and Udo Kruschwitz. 2025. A Reproducibility Study of Graph- Based Legal Case Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3135–3144

2025
[5]

Manish Gupta and Michael Bendersky. 2015. Information Retrieval with Verbose Queries. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile)(SIGIR ’15). Association for Computing Machinery, New York, NY, USA, 1121–1124. doi:10.1145/2766462. 2767877

work page doi:10.1145/2766462 2015
[6]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Towards Unsupervised Dense Information Retrieval with Contrastive Learning.arXiv preprint arXiv:2112.09118 (2021)

work page internal anchor Pith review arXiv 2021
[7]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

2019
[8]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, China)(SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3...

work page doi:10.1145/3397271.3401075 2020
[9]

Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InEuropean Conference on Information Retrieval. Springer, 237–245

2025
[10]

Dipannita Podder, Jiaul Paik, and Pabitra Mitra. 2025. Learning Query Token Importance for Effective Document Retrieval with Verbose Queries. InProceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation (FIRE ’24). Association for Computing Machinery, New York, NY, USA, 67–75. doi:10.1145/3734947.3734954

work page doi:10.1145/3734947.3734954 2025
[11]

Hongjin Qian and Zhicheng Dou. 2022. Explicit Query Rewriting for Conversa- tional Dense Retrieval. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4725–4737. doi:10.18653/v1/2022.emn...

work page doi:10.18653/v1/2022.emnlp-main.311 2022
[12]

Ning Sa and Xiaojun (Jenny) Yuan. 2021. Examining User Perception and Usage of Voice Search.Data and Information Management5, 1 (2021), 40–47. doi:10. 2478/dim-2020-0046

2021
[13]

Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756

2022
[14]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

2022
[15]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

work page internal anchor Pith review arXiv 2021
[16]

Zheng Yao, Shuai Wang, and Guido Zuccon. 2025. Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3276–3285

2025
[17]

HongChien Yu, Chenyan Xiong, and Jamie Callan. 2021. Improving Query Repre- sentations for Dense Retrieval with Pseudo Relevance Feedback. InProceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 3592–3596. doi:10.1...

work page doi:10.1145/3459637.3482124 2021