Reliable Evaluation Protocol for Low-Precision Retrieval
Pith reviewed 2026-05-19 00:51 UTC · model grok-4.3
The pith
Upcasting the final scoring step to higher precision resolves spurious ties in low-precision retrieval evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that spurious ties from low-precision scoring introduce high variability in retrieval evaluations, and this can be addressed by a protocol consisting of High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost, and Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty.
What carries the argument
High-Precision Scoring (HPS), which performs the final relevance score computation in higher precision while keeping earlier steps low-precision, combined with Tie-aware Retrieval Metrics (TRM) that account for ties by computing expected metric values over possible orderings.
If this is right
- HPS dramatically reduces tie-induced instability in evaluation results.
- TRM accurately recovers the expected values of standard retrieval metrics.
- The combination provides a consistent and reliable evaluation system for lower-precision retrievals.
- Minimal computational cost is incurred since only the final scoring step is upcast.
Where Pith is reading between the lines
- Similar tie issues might arise in other efficiency techniques like quantization, suggesting the protocol could generalize.
- Adopting this could change how benchmarks for efficient retrieval are conducted, prioritizing stability over raw speed in evaluation.
- Future work might explore integrating TRM directly into optimization loops for low-precision models.
Load-bearing premise
The primary source of evaluation instability in low-precision retrieval is spurious ties from reduced granularity, and upcasting only the final scoring step resolves them without introducing new biases or losing efficiency gains.
What would settle it
Running multiple tie-resolution variants on the same low-precision model and finding that metric variance remains high even after applying HPS, or that TRM expected values deviate significantly from full high-precision baselines.
Figures
read the original abstract
Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies spurious ties arising from reduced numerical granularity in low-precision scoring as a source of high variability in retrieval evaluation metrics. It proposes High-Precision Scoring (HPS), which upcasts only the final dot-product or similarity computation to full precision to resolve ties at low cost, together with Tie-aware Retrieval Metrics (TRM) that report expected metric values, ranges, and bias to quantify uncertainty from ties. Experiments on multiple models and scoring functions across two standard retrieval datasets are presented to show that HPS reduces tie-induced instability while TRM recovers expected metric values.
Significance. If the central claims hold, the protocol supplies a lightweight, practical remedy for an observable problem in quantized retrieval systems. The minimal-overhead nature of HPS combined with explicit uncertainty reporting in TRM could improve the reproducibility and comparability of low-precision retrieval results, which is increasingly relevant as efficiency-driven quantization becomes standard in deployed IR systems.
major comments (2)
- [Experiments] Experiments section: the description of the low-precision simulation is insufficiently detailed. It is not stated whether low-precision is applied only post-hoc to final scores computed in full precision or whether the full pipeline (indexing, ANN search, and candidate generation) is executed under reduced precision. This distinction is load-bearing for the central claim, because quantization during candidate selection can alter the initial pool or produce non-tie ordering errors that final upcasting cannot retroactively correct.
- [§3] §3 (Method), HPS definition: the claim that upcasting the final scoring step resolves ties without introducing new biases or negating efficiency gains rests on the assumption that high-precision tie-breaking aligns with the intent of the low-precision model. No analysis or ablation is provided to test whether this alignment holds when the underlying embeddings or similarity function have already been quantized.
minor comments (2)
- [Abstract] Abstract: the statement that HPS 'dramatically reduces tie-induced instability' would be strengthened by including a quantitative summary (e.g., variance reduction factor) rather than a qualitative descriptor.
- [§3] Notation: TRM is described as reporting 'expected scores, range, and bias,' but the precise formulas for these quantities (especially how ties are sampled or averaged) are not introduced until later; a brief inline definition or reference to the relevant equation would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the experimental setup and strengthening the justification for HPS. Revisions will be incorporated to improve precision and scope.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the description of the low-precision simulation is insufficiently detailed. It is not stated whether low-precision is applied only post-hoc to final scores computed in full precision or whether the full pipeline (indexing, ANN search, and candidate generation) is executed under reduced precision. This distinction is load-bearing for the central claim, because quantization during candidate selection can alter the initial pool or produce non-tie ordering errors that final upcasting cannot retroactively correct.
Authors: We thank the referee for this observation. Our low-precision simulation applies reduced precision exclusively to the final similarity computation (dot-product or scoring function) after full-precision embeddings, indexing, and ANN search have produced the candidate pool. This isolates the tie-induced instability that arises specifically from scoring granularity, which is the core focus of the proposed evaluation protocol. We agree that end-to-end quantization of the retrieval pipeline could introduce additional ordering errors upstream of scoring. We will revise the Experiments section to state this scope explicitly and add a brief discussion of the distinction, including a note that HPS targets the scoring stage without retroactive correction of candidate selection. revision: yes
-
Referee: [§3] §3 (Method), HPS definition: the claim that upcasting the final scoring step resolves ties without introducing new biases or negating efficiency gains rests on the assumption that high-precision tie-breaking aligns with the intent of the low-precision model. No analysis or ablation is provided to test whether this alignment holds when the underlying embeddings or similarity function have already been quantized.
Authors: We acknowledge the value of explicit validation for this assumption. HPS performs the final similarity computation in higher precision on the identical embeddings and model parameters used in the low-precision run; it therefore corrects only the numerical rounding that produces spurious ties rather than altering the underlying representation. This preserves the efficiency advantage because upcasting occurs only for the small subset of tied candidates. While the manuscript does not contain a dedicated ablation isolating alignment under fully quantized embeddings, the reported experiments across multiple models and scoring functions show consistent reduction in metric variance without performance degradation. We will expand the discussion in §3 to articulate this rationale more clearly and note the assumption's grounding in the design of HPS. revision: partial
Circularity Check
No significant circularity; protocol motivated by observation and validated externally
full rationale
The paper identifies spurious ties from reduced granularity in low-precision scoring as an observed practical issue and proposes HPS (upcasting only the final scoring step) plus TRM (expected scores and ranges for ties) as direct remedies. No derivation chain, equations, or claims reduce by construction to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. Central claims rest on experiments across models, scoring functions, and standard external datasets, rendering the evaluation protocol self-contained without circular reductions to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Low-precision computations produce spurious ties in relevance scores that dominate evaluation variability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bfloat16 processing for neural networks. In2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pages 88–91. IEEE. JianlvChen,ShitaoXiao,PeitianZhang,KunLuo,Defu Lian, and Zheng Liu. 2024a. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Preprint, arXiv:2402.03216. Weijie Che...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv preprint arXiv:2412.03223
Linq-embed-mistral technical report. arXiv preprint arXiv:2412.03223. Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao
-
[3]
arXiv preprint arXiv:2505.01043
Low-precision training of large language models: Methods, challenges, and opportu- nities. arXiv preprint arXiv:2505.01043. Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, RenyangGuan,ZhendongHua,ZihanLiu,YueGuan, MinyiGuo,andJingwenLeng.2025. M-ant:Efficient low-bit group quantization for llms via mathemati- cally adaptive numerical type. In2025 IEEE Intern...
-
[4]
Cu- mulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) , 20(4):422–446. EldarKurtic,AlexandreMarques,ShubhraPandit,Mark Kurtz,andDanAlistarh.2024. "givemebf16orgive me death"? accuracy-performance trade-offs in llm quantization. arXiv preprint arXiv:2411.02355. JinhyukLee,FeiyangChen,SahilDua,DanielCer,Mad- huri...
-
[5]
Gemini Embedding: Generalizable Embeddings from Gemini
Gemini embedding: Gen- eralizable embeddings from gemini.arXiv preprint arXiv:2503.07891. Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro Mos- chitti, and Lluís Màrquez
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Semi-supervised question retrieval with gated convolutions. InPro- ceedings of the 2016 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1279–1289. FrankMcSherryandMarcNajork.2008. Computingin- formation retrieval performance measures efficiently inthepresenceoftiedscores. In E...
work page 2016
-
[7]
A White Paper on Neural Network Quantization
Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 2014–2037. Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. Awhitepaperonneuralnetworkquanti- zation. arxiv 2021.arXiv preprint arXiv:21...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
GerardSalton.1983. Moderninformationretrieval. (No Title). Chaitanya Sharma
work page 1983
-
[9]
arXiv preprint arXiv:2506.00054 , year=
Retrieval-augmented gener- ation: A comprehensive survey of architectures, en- hancements, and robustness frontiers.arXiv preprint arXiv:2506.00054. Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, ChangWang,andMengniWang.2024. Efficientpost- training quantization with fp8 formats.Proceedings of Machine Learning and Systems, 6:483–498. 5 EM VOORHEES
-
[10]
Multilingual E5 Text Embeddings: A Technical Report
Multilin- gual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672. Wikipedia contributors
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xi- aoguang Li, Qun Liu, Mehdi Rezagholizad...
work page 2024
-
[12]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 embedding: Advancing text em- bedding and reranking through foundation models. arXiv preprint arXiv:2506.05176. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
HPS is im- plemented by upcasting the final scoring operation to FP32
All models are run under three data types:BF16, FP16, andFP32. HPS is im- plemented by upcasting the final scoring operation to FP32. Baseline tie-oblivious scores rely on the framework’s predefined index order inside ties. In contrast, tie-aware expectations and extrema are computed with TRM (Section 2.3). E.2 Datasets MIRACLReranking We adopt the Englis...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.