TTFT-Aware Graph Chain-of-Thought:Distance-Indexed Neural A* for Low-Hallucination Multi-Hop Medical Reasoning
Pith reviewed 2026-06-26 08:42 UTC · model grok-4.3
The pith
Hybrid PLL and AStarNet navigation on a 700K-node medical graph constrains LLM outputs to scored paths, yielding lower TTFT and fewer hallucinations than text-only RAG on fertility queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By indexing the medical graph with directed PLL for sub-millisecond distance oracles and restricting a lightweight neural A* heuristic to the resulting corridors, the method enumerates a small diverse set of paths that are scored and used to condition generation; the resulting system improves the latency-recall Pareto frontier, reduces TTFT, and lowers audited hallucinations relative to text-only RAG while preserving explanation clarity.
What carries the argument
The directed Pruned Landmark Labeling (PLL) oracle for exact distances and simple-path enumeration, paired with an AStarNet heuristic that stays inside the PLL corridor to prioritize expansions.
If this is right
- The hybrid establishes a better latency/recall Pareto frontier than text-only RAG and single-component baselines on fertility queries.
- It lowers TTFT through compact path-conditioned prompts.
- It reduces clinician-audited hallucinations while preserving explanation clarity.
- Paths are selected using CUI/semantic-type overlap, length prior, and provenance priors before generation.
Where Pith is reading between the lines
- The same corridor-constrained search could be tested on non-fertility medical queries to check domain transfer.
- If graph completeness varies across specialties, path coverage gaps would directly limit reasoning quality on those topics.
- Adding provenance weighting as an explicit tunable parameter might allow clinicians to trade explanation length against source reliability.
- The approach could be combined with existing retrieval methods to further tighten the observed Pareto frontier.
Load-bearing premise
Paths enumerated and scored from the medical knowledge graph correspond to clinically valid and sufficient reasoning steps for the target queries.
What would settle it
A head-to-head clinician audit on the same fertility queries that finds equal or higher hallucination rates for the PLL+AStarNet system than for text-only RAG.
Figures
read the original abstract
Hallucinations and opaque reasoning remain unacceptable failure modes for clinical LLMs. We present a production-grade GraphRAG stack that constrains answers to verifiable graph chain-of-thought paths in a heterogeneous, ~700K-node medical knowledge graph powering a fertility assistant. The core idea is targeted navigation: a directed Pruned Landmark Labeling (PLL) oracle provides exact distances for sub-millisecond feasibility checks and simple-path enumeration, while a lightweight AStarNet heuristic operates strictly within the PLL corridor to prioritize clinically plausible expansions. We score and pack a small, diverse set of paths (CUI/semantic-type overlap, length prior, provenance priors) to condition generation, yielding compact prompts and improved Time to First Token (TTFT). On fertility-focused queries, the hybrid (PLL+AStarNet) establishes a better latency/recall Pareto frontier than text-only RAG and single-component baselines, lowers TTFT, and reduces clinician-audited hallucinations while preserving explanation clarity. The result is a practical recipe for explainable, low-hallucination multi-hop medical reasoning ready for real-world deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TTFT-Aware Graph Chain-of-Thought, a method that uses a Pruned Landmark Labeling (PLL) oracle for exact distance queries and a neural A* heuristic (AStarNet) to enumerate and score paths in a ~700K-node medical knowledge graph. These paths condition LLM generation to reduce hallucinations in multi-hop medical reasoning for a fertility assistant. The hybrid approach is claimed to achieve a superior latency/recall Pareto frontier, lower Time to First Token (TTFT), and fewer clinician-audited hallucinations compared to text-only RAG and single-component baselines while maintaining explanation clarity.
Significance. If the performance claims and clinical validity of the selected paths hold, the work could provide a practical, deployable framework for explainable medical reasoning that balances efficiency and accuracy in production settings. The combination of exact graph oracles with learned heuristics for path prioritization is a notable technical contribution for constrained generation.
major comments (2)
- [Abstract] Abstract: The claims of establishing a better latency/recall Pareto frontier, lowering TTFT, and reducing hallucinations are presented without any quantitative results, baselines, error bars, or details of the experimental protocol, which prevents evaluation of the central empirical claims.
- [Method] Path scoring and enumeration description: The assumption that PLL-enumerated paths scored by CUI/semantic-type overlap, length prior, and provenance priors are clinically valid and sufficient reasoning steps for fertility queries is load-bearing for the low-hallucination claim but unsupported by coverage metrics, external validation against clinical guidelines, or inter-clinician agreement on path adequacy.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of establishing a better latency/recall Pareto frontier, lowering TTFT, and reducing hallucinations are presented without any quantitative results, baselines, error bars, or details of the experimental protocol, which prevents evaluation of the central empirical claims.
Authors: We agree that the abstract would benefit from including key quantitative results. In the revised version we will insert concise metrics (e.g., latency/recall deltas versus text-only RAG and single-component baselines, TTFT reduction, and hallucination-rate improvement with clinician audit) while preserving brevity. revision: yes
-
Referee: [Method] Path scoring and enumeration description: The assumption that PLL-enumerated paths scored by CUI/semantic-type overlap, length prior, and provenance priors are clinically valid and sufficient reasoning steps for fertility queries is load-bearing for the low-hallucination claim but unsupported by coverage metrics, external validation against clinical guidelines, or inter-clinician agreement on path adequacy.
Authors: The scoring priors are derived directly from the curated medical graph (CUI overlap, semantic-type constraints, provenance, and length). The clinician-audited hallucination reduction provides indirect empirical support for path utility. We will add coverage statistics and an expanded justification of the priors in the method section; however, a full external clinical-guideline validation study lies outside the scope of the present work. revision: partial
Circularity Check
No circularity; method relies on external oracles and independent scoring
full rationale
The paper describes a GraphRAG pipeline that combines an external Pruned Landmark Labeling (PLL) oracle for exact distances, a separate lightweight AStarNet heuristic, and path scoring via CUI/semantic-type overlap plus length/provenance priors. These components are presented as inputs to constrained generation rather than derived from the target metrics (TTFT, recall, hallucination rate). No equations, fitted parameters, or self-citations are shown that would reduce any claimed result to a tautology or renamed input. The performance claims on fertility queries are empirical comparisons against baselines, not predictions forced by construction from the same data used to tune the system. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The medical knowledge graph accurately and completely represents the clinical relationships needed for fertility queries
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA)
Abraham, I., Delling, D., Goldberg, A.V., Werneck, R.F.: Hierarchical hub labelings for shortest paths. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA). pp. 24–35 (2012). https://doi.org/10.1007/ 978-3-642-33090-2_3
2012
-
[2]
In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. pp. 349–360 (2013). https://doi.org/10.1145/2463676.2465312, https://arxiv.org/abs/1304.4661
-
[3]
In: Companion Proceedings of the 23rd International Conference on World Wide Web (WWW Companion)
Akiba, T., Iwata, Y., Yoshida, Y.: Dynamic and historical shortest-path distance queries on large evolving networks by pruned landmark labeling. In: Companion Proceedings of the 23rd International Conference on World Wide Web (WWW Companion). pp. 237–238 (2014). https://doi.org/10.1145/2567948.2579222
-
[4]
doi:10.1038/s41746-025-01670-7 , issn =
Asgari, E., Montaña-Brown, N., Dubois, M., Khalil, S., Balloch, J., Au Ye- ung, J., Pimenta, D.: A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine8(274), 1– 15 (2025). https://doi.org/10.1038/s41746-025-01670-7, https://www.nature.com/ articles/s41746-025-01670-7
-
[5]
Nucleic Acids Research32(suppl_1), D267–D270 (2004)
Bodenreider, O.: The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research32(suppl_1), D267–D270 (2004). https://doi.org/10.1093/nar/gkh061
-
[6]
Scientific Data10(67), 1–16 (2023)
Chandak, P., Huang, K., Zitnik, M.: Building a knowledge graph to enable pre- cision medicine. Scientific Data10(67), 1–16 (2023). https://doi.org/10.1038/ s41597-023-01960-3, https://www.nature.com/articles/s41597-023-01960-3 14 B. Dardouri et al
2023
-
[7]
GitHub repository (2024), https://github.com/DeepGraphLearning/AStarNet
DeepGraphLearning: AStarNet: Official implementation of a* networks. GitHub repository (2024), https://github.com/DeepGraphLearning/AStarNet
2024
-
[8]
arXiv preprint (2024), https://arxiv.org/abs/2404.16130
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Larson, J.: From local to global: A graph RAG approach to query-focused summarization. arXiv preprint (2024), https://arxiv.org/abs/2404.16130
Pith/arXiv arXiv 2024
-
[9]
SIAM Journal on Computing28(2), 652–673 (1999)
Eppstein, D.: Finding the k shortest paths. SIAM Journal on Computing28(2), 652–673 (1999). https://doi.org/10.1137/S0097539795290477
-
[10]
ExplosionAI:ScispaCyEntityLinkerdocumentation.Documentation(2024),https: //scispacy.readthedocs.io/en/latest/linking.html
2024
-
[11]
In: Proceedings of the 16th Annual ACM–SIAM Symposium on DiscreteAlgorithms (SODA)
Goldberg, A.V., Harrelson, C.: Computing the shortest path: A* search meets graph theory. In: Proceedings of the 16th Annual ACM–SIAM Symposium on DiscreteAlgorithms (SODA). pp.156–165(2005), https://dl.acm.org/doi/10.5555/ 1070432.1070455
arXiv 2005
-
[12]
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determina- tion of minimum cost paths. IEEE Transactions on Systems Science and Cybernet- ics4(2), 100–107 (1968). https://doi.org/10.1109/TSSC.1968.300136
-
[13]
Hershberger, J., Suri, S.: Vickrey prices and shortest paths: What is an edge worth? In: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS). pp. 252–259 (2003). https://doi.org/10.1109/SFCS.2003.1238196
-
[14]
GitHub repository (2013), https://github.com/iwiwi/pruned-landmark-labeling
Iwata, Y.: Pruned landmark labeling (PLL): Reference implementation. GitHub repository (2013), https://github.com/iwiwi/pruned-landmark-labeling
2013
-
[15]
Artificial Intelligence in Medicine117, 102083 (2021)
Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A.A., Roberts, A., Bendayan, R., Richardson, M.P., Stewart, R., Shah, A.D., Wong, W.K., Ibrahim, Z., Teo, J.T., Dobson, R.J.B.: Multi- domain clinical natural language processing with MedCAT: The medical con- cept annotation toolkit. Artificial Intelligence...
-
[16]
Product documentation (2025), https://neo4j.com/docs/
Neo4j, Inc.: Neo4j graph database documentation. Product documentation (2025), https://neo4j.com/docs/
2025
-
[17]
Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: Fast and robust mod- els for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task. pp. 319–327 (2019). https://doi.org/10.18653/v1/ W19-5034, https://aclanthology.org/W19-5034/
-
[18]
Product documentation (2020), https://docs.nvidia.com/dgx/pdf/dgx-a100-user-guide.pdf
NVIDIA: NVIDIA DGX A100 system: User guide. Product documentation (2020), https://docs.nvidia.com/dgx/pdf/dgx-a100-user-guide.pdf
2020
-
[19]
NVIDIA NIM Benchmarking Documentation (2024), https: //docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
NVIDIA: LLM inference benchmarking: Fundamental concepts (time to first token and related metrics). NVIDIA NIM Benchmarking Documentation (2024), https: //docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
2024
-
[20]
NVIDIA NIM Benchmarking Documentation (2024), https://docs.nvidia.com/ nim/benchmarking/llm/latest/parameters.html
NVIDIA: LLM inference benchmarking: Parameters and best practices. NVIDIA NIM Benchmarking Documentation (2024), https://docs.nvidia.com/ nim/benchmarking/llm/latest/parameters.html
2024
-
[21]
Management Science 17(11), 712–716 (1971)
Yen, J.Y.: Finding the k shortest loopless paths in a network. Management Science 17(11), 712–716 (1971). https://doi.org/10.1287/mnsc.17.11.712
-
[22]
Databricks Blog (2023), https://www.databricks
Zaharia,M.,Lee,J.,Wendell,P.,Chien,C.,Graves,P.:Timetofirsttoken(TTFT): What it is and how to improve it. Databricks Blog (2023), https://www.databricks. com/blog/time-first-token-ttft-what-it-and-how-improve-it
2023
-
[23]
In: Advances in Neural Information Processing Systems (NeurIPS)
Zhu, Z., Yuan, X., Galkin, M., Xhonneux, S., Zhang, M., Gazeau, M., Tang, J.: A*net: A scalable path-based reasoning approach for knowledge graphs. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36 (2023), https://arxiv.org/abs/2206.04798
arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.