pith. sign in

arxiv: 2508.03306 · v4 · submitted 2025-08-05 · 💻 cs.IR · cs.AI· cs.CL

Reliable Evaluation Protocol for Low-Precision Retrieval

Pith reviewed 2026-05-19 00:51 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords low-precision retrievalevaluation protocolspurious tieshigh-precision scoringtie-aware metricsinformation retrievalrelevance scoring
0
0 comments X

The pith

Upcasting the final scoring step to higher precision resolves spurious ties in low-precision retrieval evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Lowering numerical precision in retrieval models improves efficiency but creates spurious ties in relevance scores because of reduced granularity. These ties cause high variability in evaluation results depending on how they are resolved, undermining reliability. The paper introduces High-Precision Scoring that upcasts only the final scoring computation to fix ties at low extra cost, paired with Tie-aware Retrieval Metrics that report expected values, ranges, and biases for uncertainty. Experiments across models, scoring functions, and datasets show this reduces instability and recovers accurate metrics. If correct, this allows trustworthy benchmarking of efficient low-precision systems without sacrificing their speed advantages.

Core claim

The central claim is that spurious ties from low-precision scoring introduce high variability in retrieval evaluations, and this can be addressed by a protocol consisting of High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost, and Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty.

What carries the argument

High-Precision Scoring (HPS), which performs the final relevance score computation in higher precision while keeping earlier steps low-precision, combined with Tie-aware Retrieval Metrics (TRM) that account for ties by computing expected metric values over possible orderings.

If this is right

  • HPS dramatically reduces tie-induced instability in evaluation results.
  • TRM accurately recovers the expected values of standard retrieval metrics.
  • The combination provides a consistent and reliable evaluation system for lower-precision retrievals.
  • Minimal computational cost is incurred since only the final scoring step is upcast.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tie issues might arise in other efficiency techniques like quantization, suggesting the protocol could generalize.
  • Adopting this could change how benchmarks for efficient retrieval are conducted, prioritizing stability over raw speed in evaluation.
  • Future work might explore integrating TRM directly into optimization loops for low-precision models.

Load-bearing premise

The primary source of evaluation instability in low-precision retrieval is spurious ties from reduced granularity, and upcasting only the final scoring step resolves them without introducing new biases or losing efficiency gains.

What would settle it

Running multiple tie-resolution variants on the same low-precision model and finding that metric variance remains high even after applying HPS, or that TRM expected values deviate significantly from full high-precision baselines.

Figures

Figures reproduced from arXiv: 2508.03306 by Heuiseok Lim, Hwanseok Jang, Isabelle Augenstein, Kenneth Choi, Kisu Yang, Yoonna Jang.

Figure 1
Figure 1. Figure 1: Example of tie-induced instability in evalua [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tie-oblivious and expectation scores of nDCG [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bit layouts of FP16, BF16, and FP32 formats (Wikipedia contributors, 2025) Floating-Point Value A floating-point value is a way to represent numbers in computer systems, and typically encoded as three fields—sign, expo￾nent, and mantissa (also called the fraction)—as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Metric scores for cutoff 𝑘 of Qwen3-Reranker-0.6B on MIRACLReranking dataset. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Metric scores for cutoff 𝑘 of Qwen3-Reranker-0.6B on AskUbuntuDupQuestions dataset. In this dataset, all tie-oblivious metrics attain their maximum possible value (being overestimated) because, during candidate construction, every relevant item is concatenated ahead of all non-relevant ones.6 6https://github.com/embeddings-benchmark/mteb/blob/1.38.38/mteb/evaluation/evaluators/ RerankingEvaluator.py#L175 1… view at source ↗
read the original abstract

Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies spurious ties arising from reduced numerical granularity in low-precision scoring as a source of high variability in retrieval evaluation metrics. It proposes High-Precision Scoring (HPS), which upcasts only the final dot-product or similarity computation to full precision to resolve ties at low cost, together with Tie-aware Retrieval Metrics (TRM) that report expected metric values, ranges, and bias to quantify uncertainty from ties. Experiments on multiple models and scoring functions across two standard retrieval datasets are presented to show that HPS reduces tie-induced instability while TRM recovers expected metric values.

Significance. If the central claims hold, the protocol supplies a lightweight, practical remedy for an observable problem in quantized retrieval systems. The minimal-overhead nature of HPS combined with explicit uncertainty reporting in TRM could improve the reproducibility and comparability of low-precision retrieval results, which is increasingly relevant as efficiency-driven quantization becomes standard in deployed IR systems.

major comments (2)
  1. [Experiments] Experiments section: the description of the low-precision simulation is insufficiently detailed. It is not stated whether low-precision is applied only post-hoc to final scores computed in full precision or whether the full pipeline (indexing, ANN search, and candidate generation) is executed under reduced precision. This distinction is load-bearing for the central claim, because quantization during candidate selection can alter the initial pool or produce non-tie ordering errors that final upcasting cannot retroactively correct.
  2. [§3] §3 (Method), HPS definition: the claim that upcasting the final scoring step resolves ties without introducing new biases or negating efficiency gains rests on the assumption that high-precision tie-breaking aligns with the intent of the low-precision model. No analysis or ablation is provided to test whether this alignment holds when the underlying embeddings or similarity function have already been quantized.
minor comments (2)
  1. [Abstract] Abstract: the statement that HPS 'dramatically reduces tie-induced instability' would be strengthened by including a quantitative summary (e.g., variance reduction factor) rather than a qualitative descriptor.
  2. [§3] Notation: TRM is described as reporting 'expected scores, range, and bias,' but the precise formulas for these quantities (especially how ties are sampled or averaged) are not introduced until later; a brief inline definition or reference to the relevant equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the experimental setup and strengthening the justification for HPS. Revisions will be incorporated to improve precision and scope.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the description of the low-precision simulation is insufficiently detailed. It is not stated whether low-precision is applied only post-hoc to final scores computed in full precision or whether the full pipeline (indexing, ANN search, and candidate generation) is executed under reduced precision. This distinction is load-bearing for the central claim, because quantization during candidate selection can alter the initial pool or produce non-tie ordering errors that final upcasting cannot retroactively correct.

    Authors: We thank the referee for this observation. Our low-precision simulation applies reduced precision exclusively to the final similarity computation (dot-product or scoring function) after full-precision embeddings, indexing, and ANN search have produced the candidate pool. This isolates the tie-induced instability that arises specifically from scoring granularity, which is the core focus of the proposed evaluation protocol. We agree that end-to-end quantization of the retrieval pipeline could introduce additional ordering errors upstream of scoring. We will revise the Experiments section to state this scope explicitly and add a brief discussion of the distinction, including a note that HPS targets the scoring stage without retroactive correction of candidate selection. revision: yes

  2. Referee: [§3] §3 (Method), HPS definition: the claim that upcasting the final scoring step resolves ties without introducing new biases or negating efficiency gains rests on the assumption that high-precision tie-breaking aligns with the intent of the low-precision model. No analysis or ablation is provided to test whether this alignment holds when the underlying embeddings or similarity function have already been quantized.

    Authors: We acknowledge the value of explicit validation for this assumption. HPS performs the final similarity computation in higher precision on the identical embeddings and model parameters used in the low-precision run; it therefore corrects only the numerical rounding that produces spurious ties rather than altering the underlying representation. This preserves the efficiency advantage because upcasting occurs only for the small subset of tied candidates. While the manuscript does not contain a dedicated ablation isolating alignment under fully quantized embeddings, the reported experiments across multiple models and scoring functions show consistent reduction in metric variance without performance degradation. We will expand the discussion in §3 to articulate this rationale more clearly and note the assumption's grounding in the design of HPS. revision: partial

Circularity Check

0 steps flagged

No significant circularity; protocol motivated by observation and validated externally

full rationale

The paper identifies spurious ties from reduced granularity in low-precision scoring as an observed practical issue and proposes HPS (upcasting only the final scoring step) plus TRM (expected scores and ranges for ties) as direct remedies. No derivation chain, equations, or claims reduce by construction to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. Central claims rest on experiments across models, scoring functions, and standard external datasets, rendering the evaluation protocol self-contained without circular reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no free parameters, invented entities, or non-standard axioms are visible. The central assumption is treated as a domain observation rather than a derived result.

axioms (1)
  • domain assumption Low-precision computations produce spurious ties in relevance scores that dominate evaluation variability
    Stated directly in the abstract as the motivating observation.

pith-pipeline@v0.9.0 · 5710 in / 1207 out tokens · 51970 ms · 2026-05-19T00:51:32.611583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Bfloat16 processing for neural networks. In2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pages 88–91. IEEE. JianlvChen,ShitaoXiao,PeitianZhang,KunLuo,Defu Lian, and Zheng Liu. 2024a. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Preprint, arXiv:2402.03216. Weijie Che...

  2. [2]

    arXiv preprint arXiv:2412.03223

    Linq-embed-mistral technical report. arXiv preprint arXiv:2412.03223. Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao

  3. [3]

    arXiv preprint arXiv:2505.01043

    Low-precision training of large language models: Methods, challenges, and opportu- nities. arXiv preprint arXiv:2505.01043. Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, RenyangGuan,ZhendongHua,ZihanLiu,YueGuan, MinyiGuo,andJingwenLeng.2025. M-ant:Efficient low-bit group quantization for llms via mathemati- cally adaptive numerical type. In2025 IEEE Intern...

  4. [4]

    give me bf16 or give me death

    Cu- mulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) , 20(4):422–446. EldarKurtic,AlexandreMarques,ShubhraPandit,Mark Kurtz,andDanAlistarh.2024. "givemebf16orgive me death"? accuracy-performance trade-offs in llm quantization. arXiv preprint arXiv:2411.02355. JinhyukLee,FeiyangChen,SahilDua,DanielCer,Mad- huri...

  5. [5]

    Gemini Embedding: Generalizable Embeddings from Gemini

    Gemini embedding: Gen- eralizable embeddings from gemini.arXiv preprint arXiv:2503.07891. Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro Mos- chitti, and Lluís Màrquez

  6. [6]

    InPro- ceedings of the 2016 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1279–1289

    Semi-supervised question retrieval with gated convolutions. InPro- ceedings of the 2016 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1279–1289. FrankMcSherryandMarcNajork.2008. Computingin- formation retrieval performance measures efficiently inthepresenceoftiedscores. In E...

  7. [7]

    A White Paper on Neural Network Quantization

    Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 2014–2037. Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. Awhitepaperonneuralnetworkquanti- zation. arxiv 2021.arXiv preprint arXiv:21...

  8. [8]

    Moderninformationretrieval

    GerardSalton.1983. Moderninformationretrieval. (No Title). Chaitanya Sharma

  9. [9]

    arXiv preprint arXiv:2506.00054 , year=

    Retrieval-augmented gener- ation: A comprehensive survey of architectures, en- hancements, and robustness frontiers.arXiv preprint arXiv:2506.00054. Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, ChangWang,andMengniWang.2024. Efficientpost- training quantization with fp8 formats.Proceedings of Machine Learning and Systems, 6:483–498. 5 EM VOORHEES

  10. [10]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilin- gual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672. Wikipedia contributors

  11. [11]

    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412

    mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xi- aoguang Li, Qun Liu, Mehdi Rezagholizad...

  12. [12]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 embedding: Advancing text em- bedding and reranking through foundation models. arXiv preprint arXiv:2506.05176. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang

  13. [13]

    HPS is im- plemented by upcasting the final scoring operation to FP32

    All models are run under three data types:BF16, FP16, andFP32. HPS is im- plemented by upcasting the final scoring operation to FP32. Baseline tie-oblivious scores rely on the framework’s predefined index order inside ties. In contrast, tie-aware expectations and extrema are computed with TRM (Section 2.3). E.2 Datasets MIRACLReranking We adopt the Englis...