arxiv: 2603.28886 · v2 · submitted 2026-03-30 · 💻 cs.IR · cs.LG

Recognition: no theorem link

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

Andre Bacellar

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:35 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords multi-hop question answeringscore calibrationheterogeneous retrieval fusiongraph-vector retrievalpercentile normalizationlast-hop retrievalPersonalized PageRankfusion operators

0 comments

The pith

Percentile-rank calibration enables stable fusion of vector and graph scores for multi-hop question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the problem of combining dense vector similarity scores with graph-based relevance signals such as Personalized PageRank in multi-hop question answering retrieval. These signals come from different distributions and cannot be added or compared directly without distortion. The proposed PhaseGraph method first converts both score types to a shared percentile-rank scale, then fuses them. On held-out test splits of MuSiQue and 2WikiMultiHopQA, this calibrated fusion raises last-hop retrieval accuracy at the top-5 cutoff by 1.4 and 1.9 points respectively. Ablations indicate that the percentile step is more important for robustness than the exact choice of fusion operator applied afterward.

Core claim

By mapping vector similarity and graph propagation scores to a common percentile scale before fusion, the method improves held-out last-hop retrieval precision on multi-hop QA benchmarks, raising LastHop@5 from 75.1 percent to 76.5 percent on MuSiQue and from 51.7 percent to 53.6 percent on 2WikiMultiHopQA, with both gains statistically significant on independent test splits.

What carries the argument

Percentile-rank normalization (PIT), which converts raw scores into their relative ranks within the candidate pool to produce a unit-free scale that preserves ordering and magnitude differences.

If this is right

Calibrated fusion raises LastHop@5 by 1.4 points on MuSiQue (p=0.039) and 1.9 points on 2WikiMultiHopQA (p=0.023) on held-out splits.
Percentile-based normalization is directionally more robust than min-max scaling on both tuning and test data.
After calibration, Boltzmann weighting yields retrieval results comparable to linear fusion.
Score commensuration itself is a reliable design choice whose benefits appear consistent across the two benchmarks examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same percentile alignment step could be tested as a default preprocessing layer in any retrieval pipeline that merges signals from embedding models and graph algorithms.
If the gains remain stable on larger corpora, percentile calibration might reduce the need for per-dataset hyperparameter tuning of fusion weights.
The modest size of the observed improvements suggests that calibration alone may not be sufficient for large leaps in multi-hop performance without complementary advances in the underlying graph or embedding components.

Load-bearing premise

Percentile-rank normalization produces a stable common scale that preserves useful magnitude information across varying score distributions and does not introduce new biases on unseen data.

What would settle it

A new multi-hop QA test collection in which applying percentile calibration before fusion fails to improve or actively lowers last-hop retrieval metrics compared with direct fusion of the raw scores.

read the original abstract

Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small gains from percentile calibration in graph-vector fusion for multi-hop QA retrieval, with the main open question being stability under distribution shift.

read the letter

The paper's main result is that applying percentile-rank normalization to both vector similarity scores and Personalized PageRank scores before fusing them improves last-hop retrieval accuracy by a small margin on two multi-hop QA benchmarks. Specifically, LastHop@5 goes from 75.1% to 76.5% on MuSiQue and from 51.7% to 53.6% on 2WikiMultiHopQA, with p-values below 0.05 on held-out splits. What the work does well is treat the fusion problem explicitly as one of score calibration. They show through ablations that percentile normalization is more robust than min-max scaling on both tuning and test data, and that after calibration the choice between linear and Boltzmann fusion makes little difference. This gives a clean picture that the calibration step is the useful part. The use of independent held-out test sets and reported p-values adds some credibility to the directional claims. The soft spots are mostly around scale and robustness. The absolute gains are only 1-2 points, which is narrow even for this subtask. Without error bars or variance estimates on the main metric, it's difficult to gauge how reliable the improvement is in practice. The abstract also skips implementation details like how the reference sets for the percentiles are constructed or whether they are query-specific. The stress-test concern about distribution shift is relevant here: if the score histograms change between train and test due to different hop structures or graph properties, the percentile mapping could shift and weaken or flip the observed gains. The paper does not appear to include a direct check for that. Overall this is a targeted methods paper aimed at people already working on graph-augmented dense retrieval for question answering. It provides a practical calibration technique that might be worth testing in similar pipelines, but it does not claim or demonstrate broad applicability beyond these benchmarks. I would bring this to a reading group to talk through the ablation design and whether the calibration approach generalizes. It deserves serious peer review because the experiments are grounded in standard datasets with statistical reporting, even if the effect sizes are modest and more validation on the normalization stability would strengthen it.

Referee Report

2 major / 1 minor

Summary. The paper proposes PhaseGraph, which applies percentile-rank normalization (PIT) to calibrate dense vector similarity scores and graph-based Personalized PageRank (PPR) scores onto a common unit-free scale for fusion in multi-hop QA retrieval. It reports modest but statistically significant gains in last-hop retrieval on held-out test splits of MuSiQue (LastHop@5: 75.1% to 76.5%, p=0.039) and 2WikiMultiHopQA (51.7% to 53.6%, p=0.023), along with ablations showing PIT is directionally more robust than min-max normalization and that Boltzmann weighting performs comparably to linear fusion post-calibration.

Significance. If the PIT calibration proves stable, the work addresses a practical challenge in fusing heterogeneous signals in graph-augmented retrieval systems. The use of independent held-out splits and directional ablations provides some empirical grounding, and the finding that the post-calibration operator matters less than the calibration step itself could inform design choices in similar pipelines. However, the small effect sizes (1.4–1.9 points) limit the immediate practical significance without stronger evidence of robustness.

major comments (2)

[Abstract] Abstract: The claim that PIT produces a stable, magnitude-preserving common scale for fusion rests on an untested assumption about reference distribution representativeness. No analysis of score histogram shifts between training and held-out test queries is described, yet such shifts (from differing hop counts, graph densities, or embedding variances) could remap raw scores and alter effective fusion weights. Given the modest gains, this omission is load-bearing for the central robustness claim.
[Results] Results and Experimental Protocol: The reported p-values (e.g., p=0.039 on MuSiQue, p=0.023 on 2WikiMultiHopQA) are presented without error bars on the primary LastHop@5 metric, details on the number of runs, or the exact statistical test procedure. This reduces confidence in the reliability of the significance claims despite the use of held-out splits.

minor comments (1)

[Abstract] The abstract refers to a 'theory-driven ablation' but provides no explicit theoretical motivation or derivation for why percentile ranks should outperform min-max; this should be expanded in the methods or discussion for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each of the major concerns below and outline the revisions we will make to improve the clarity and robustness of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that PIT produces a stable, magnitude-preserving common scale for fusion rests on an untested assumption about reference distribution representativeness. No analysis of score histogram shifts between training and held-out test queries is described, yet such shifts (from differing hop counts, graph densities, or embedding variances) could remap raw scores and alter effective fusion weights. Given the modest gains, this omission is load-bearing for the central robustness claim.

Authors: We agree that an explicit analysis of score distribution stability would strengthen the central claim. Although the current manuscript relies on held-out test splits for evaluation, it does not include a direct comparison of score histograms or percentile mappings between training and test queries. In the revised version, we will add this analysis, including plots of cumulative distribution functions for vector similarity and PPR scores on both splits, along with a discussion of any observed shifts and their implications for fusion weights. This will provide empirical support for the representativeness of the reference distribution used in PIT. revision: yes
Referee: [Results] Results and Experimental Protocol: The reported p-values (e.g., p=0.039 on MuSiQue, p=0.023 on 2WikiMultiHopQA) are presented without error bars on the primary LastHop@5 metric, details on the number of runs, or the exact statistical test procedure. This reduces confidence in the reliability of the significance claims despite the use of held-out splits.

Authors: We agree that the statistical details are insufficiently reported. In the revised manuscript, we will include error bars on the primary LastHop@5 metric, specify the number of runs, and describe the exact statistical test procedure in the experimental protocol section. These changes will address the concern and increase confidence in the significance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical calibration evaluated on held-out benchmarks

full rationale

The paper proposes percentile-rank normalization (PIT) as a calibration step for fusing vector and PPR scores, then reports direct empirical gains on independent held-out test splits (LastHop@5 improvements with p-values). No derivation chain, equations, or predictions are present that reduce to fitted parameters, self-definitions, or self-citation load-bearing premises. All claims rest on measured performance rather than tautological mappings or renamings of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical; the abstract states no explicit free parameters, axioms, or invented entities. The calibration relies on the standard statistical property that percentile ranks are distribution-free.

pith-pipeline@v0.9.0 · 5547 in / 1055 out tokens · 44361 ms · 2026-05-14T01:35:05.345900+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering
cs.IR 2026-04 conditional novelty 6.0

BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.