Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy
Pith reviewed 2026-06-28 20:37 UTC · model grok-4.3
The pith
Factual density reranking surfaces all relevant medical evidence in top-5 results where standard similarity search fails.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Factual Density (FD*) measures the proportion of verified atomic claims relative to total token count after probabilistic factuality analysis and Z-score normalization within length bins. On the HealthFC benchmark, FD*-optimized retrieval was the only condition to achieve 100 percent systematic review saturation in top-5 results, surfacing Cochrane evidence ranked outside the top ten by cosine similarity, with ground truth verification confirming 25 mappings across seven supported claims.
What carries the argument
Factual Density (FD*), the proportion of verified atomic claims to total token count, computed via probabilistic factuality analysis before corpus ingestion and made length-independent by Z-score normalization within length bins.
If this is right
- FD* reranking surfaces Cochrane evidence that standard cosine similarity ranks beyond the top ten.
- Z-score normalization within length bins removes the severe document-length confound (Pearson R = -0.8636).
- Ground truth verification confirms 25 mappings across seven HealthFC-supported claims under the FD* condition.
- Factual density reranking offers a low-cost intervention for factual precision in health RAG architectures.
Where Pith is reading between the lines
- The same pre-scoring pipeline could be applied to other domains that maintain expert-verified claim sets, such as legal precedents or scientific abstracts.
- Hybrid ranking that combines FD* with existing similarity scores may improve overall recall at negligible extra cost.
- Extending the evaluation to the full n=50 query set would test whether the observed saturation advantage persists beyond the reported cases.
Load-bearing premise
The probabilistic factuality analysis produces accurate, unbiased labels for atomic claims that remain independent of the retrieval ranking task.
What would settle it
Running the full evaluation on the complete set of 50 aligned queries and checking whether FD* still achieves 100 percent saturation while cosine similarity continues to miss the same Cochrane items.
read the original abstract
Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard RAG retrieval suffers from an 'Expert Blindness Effect' by favoring lexically similar but low-fact-density text, and introduces Factual Density (FD*)—defined as the proportion of verified atomic claims (from probabilistic factuality analysis in the Ghost Audit pipeline) to total tokens—as a retrieval signal. After observing a strong negative length correlation (Pearson R = -0.8636) in an initial formulation, the authors apply Z-score normalization within length bins to produce a length-independent signal (post-fix p = 0.0749). On the HealthFC benchmark, FD*-optimized retrieval is reported as the only method achieving 100% systematic review saturation in top-5 results, surfacing Cochrane evidence missed by cosine similarity, with ground-truth verification of 25 mappings across seven claims; full statistical validation across n=50 queries is noted as future work.
Significance. If the central claims hold after proper validation, FD* could provide a lightweight, domain-agnostic reranking signal that improves factual precision in medical RAG without requiring changes to the underlying retriever. The reported 100% saturation outcome and the explicit contrast with cosine similarity on a concrete benchmark constitute a falsifiable prediction that, if replicated, would be of practical interest to health-AI systems. However, the current evidence base is preliminary and the significance is constrained by the absence of independent validation for the factuality labels on which FD* depends.
major comments (3)
- [Abstract] Abstract and Ghost Audit pipeline description: FD* is defined using probabilistic factuality labels that are applied both to filter the corpus before ingestion and to compute the density scores yielding the 100% saturation result. No inter-annotator agreement, expert validation set, calibration details, or comparison against human medical labels is reported for this analysis. Because the performance gain and the length-normalization fix rest on these labels, the absence of external validation is load-bearing for the central claim.
- [Abstract] Abstract: The length confound (Pearson R = -0.8636, p = 2.27e-07) was identified on the same data used to motivate and evaluate the Z-score normalization fix (post-fix p = 0.0749). This raises the possibility that the normalization boundaries and the reported independence are post-hoc adjustments rather than an a-priori, held-out test of the FD* signal.
- [Abstract] Abstract: The 100% top-5 saturation claim and the statement that 'FD*-optimized retrieval was the only condition' to achieve it are presented without error bars, multiple-run statistics, or the full n=50 query results (explicitly deferred to future work). The ground-truth verification of 25 mappings is mentioned but not broken down by query or retrieval condition, making it impossible to assess robustness of the superiority claim.
minor comments (2)
- [Abstract] The abstract states that full statistical validation 'remains future work due to constraints on corpus-benchmark alignment'; a brief description of those alignment constraints would help readers understand the scope of the current results.
- No table or supplementary material is referenced that lists the 25 verified mappings or the seven HealthFC claims, which would allow independent inspection of the ground-truth verification step.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the preliminary nature of our study. We address each major comment point-by-point below with honest clarifications on what the current manuscript can and cannot support.
read point-by-point responses
-
Referee: [Abstract] Abstract and Ghost Audit pipeline description: FD* is defined using probabilistic factuality labels that are applied both to filter the corpus before ingestion and to compute the density scores yielding the 100% saturation result. No inter-annotator agreement, expert validation set, calibration details, or comparison against human medical labels is reported for this analysis. Because the performance gain and the length-normalization fix rest on these labels, the absence of external validation is load-bearing for the central claim.
Authors: We agree that the manuscript does not report inter-annotator agreement, expert validation sets, or direct comparisons of the Ghost Audit probabilistic labels against human medical annotations. The pipeline is presented as an automated, lightweight preprocessing tool rather than a human-validated factuality oracle. This reliance is a genuine limitation for the central claims. We will revise the manuscript to add an explicit Limitations section discussing the probabilistic nature of the labels and the absence of external validation. revision: yes
-
Referee: [Abstract] Abstract: The length confound (Pearson R = -0.8636, p = 2.27e-07) was identified on the same data used to motivate and evaluate the Z-score normalization fix (post-fix p = 0.0749). This raises the possibility that the normalization boundaries and the reported independence are post-hoc adjustments rather than an a-priori, held-out test of the FD* signal.
Authors: The length correlation was observed during exploratory analysis on the HealthFC corpus, prompting the development of the Z-score normalization within length bins as a methodological correction. The post-fix p-value reflects the outcome of that correction applied uniformly to the same benchmark. While we acknowledge the data overlap, the normalization procedure is deterministic and was not tuned to achieve a specific result on held-out data. We will add a sentence clarifying the exploratory origin of the fix but maintain that it produces a length-independent signal as reported. revision: partial
-
Referee: [Abstract] Abstract: The 100% top-5 saturation claim and the statement that 'FD*-optimized retrieval was the only condition' to achieve it are presented without error bars, multiple-run statistics, or the full n=50 query results (explicitly deferred to future work). The ground-truth verification of 25 mappings is mentioned but not broken down by query or retrieval condition, making it impossible to assess robustness of the superiority claim.
Authors: We agree the 100% saturation result is presented without error bars, multiple runs, or the full n=50 statistics, and the ground-truth verification of 25 mappings lacks per-query breakdown. The manuscript already states that full statistical validation across n=50 queries is future work due to corpus-benchmark alignment constraints. We will revise the abstract and results to emphasize the preliminary character of the 100% figure, remove any implication of definitive superiority, and note the limited scope of the 25-mapping verification. revision: yes
- Independent expert validation or inter-annotator agreement for the Ghost Audit probabilistic factuality labels
- Full n=50 query statistical results with error bars and per-condition breakdowns, as these are deferred to future work
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines FD* using probabilistic factuality analysis from the Ghost Audit pipeline as a preprocessing step, observes a length confound on initial formulation, applies Z-score normalization within bins, and evaluates the resulting retrieval on the external HealthFC benchmark with expert labels and ground-truth mappings. No equations or steps are shown that reduce the reported 100% saturation result or performance claims to the inputs by construction. The benchmark evaluation provides independent verification separate from the internal scoring pipeline, making the central claim self-contained against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- length bin boundaries for Z-score normalization
axioms (2)
- domain assumption Probabilistic factuality analysis produces accurate counts of verified atomic claims
- domain assumption HealthFC labels constitute reliable ground truth for medical claim support
invented entities (2)
-
Factual Density (FD*)
no independent evidence
-
Expert Blindness Effect
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introduction Health misinformation is a documented public health risk with measurable long-term consequences for individuals and health systems (Tabatabaei Far & Ahmadi Marzaleh, 2025). As large language models become embedded in consumer-facing health applications, the reliability of the information they surface has moved from an academic concern to a cl...
2025
-
[2]
(2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks
Related Work Evaluating Factual Density in Multi-Source RAG NexusAgentics Research arXiv preprint Page 3 - 2.1 Retrieval-Augmented Generation Lewis et al. (2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks. Gao et al. (2023) subsequently mapped th...
2021
-
[3]
Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision
provides 750 health claims annotated for veracity by medical experts across three labels: Supported, Refuted, and No Evidence. Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision. HealthFC labels are withheld from the ingestion and retrieval pipeline entirely in this work, pr...
2024
-
[4]
A 2021 RCT found 45% efficacy in Phase 3 trials
Methodology 3.1 Corpus Construction A 600-chunk evidence hierarchy corpus was constructed from three source tiers, each representing a distinct level of medical evidence authority. All abstracts were retrieved via the NCBI Entrez API using the Biopython library, ensuring full reproducibility: any researcher with an NCBI email can execute the identical que...
2021
-
[5]
Conclusion This paper introduced Factual Density (FD*), a novel retrieval optimization signal for health RAG systems that measures the concentration of probabilistically verified atomic claims per token. Three experiments were conducted to validate the metric, characterize a previously undocumented retrieval failure mode, and establish a methodology for c...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.13845 2023
-
[6]
(pp. 8095-8107). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.709 Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J. L., Moor, M., Alexander, K., Ashley, E., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., Nelson, J., & Hiesinger, W. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2). https://doi.org/1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.