pith. sign in

arxiv: 2605.23559 · v1 · pith:E4BA7S56new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

Pith reviewed 2026-05-25 05:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords whole-slide imagevisual question answeringtraining-free agentpathology navigationsurprise-guided scanshared slide memoryWSI-VQA
0
0 comments X

The pith

PathNavigate first builds a slide-specific map of abnormal regions at low magnification using shared memory, then searches only inside that map for question-relevant high-resolution evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free agent for visual question answering on gigapixel whole-slide pathology images. It replaces the common question-first relevance search with a scan-search-readout sequence: a shared online memory over frozen features first generates a surprise field marking abnormal regions across the entire slide at low magnification; question-conditioned matching then occurs only inside that pool to pick high-magnification targets; finally a frozen perceptor-adjudicator stack reads the answer while reusing the same memory for slide-level context. This design targets the problem that decisive morphology is often unnamed in the query and therefore missed by purely query-driven candidate selection. Experiments on WSI-VQA and SlideBench-BCNB report higher answer accuracy, more interpretable trajectories, and better efficiency under fixed inspection budgets. A sympathetic reader cares because the method keeps all core models frozen, removing the need to retrain for each new pathology task or slide collection.

Core claim

PathNavigate establishes that a shared online memory over frozen pathology features can produce a low-magnification surprise field marking an abnormal-region pool before any question matching occurs; question-conditioned PLIP relevance is then restricted to this pool to select high-magnification search targets; local evidence is extracted and answered by a frozen perceptor-adjudicator stack that re-uses the same memory as slide-level context, yielding improved accuracy and efficiency on WSI-VQA and SlideBench-BCNB without task-specific training.

What carries the argument

The surprise field generated by the shared online memory module over frozen features, which identifies abnormal regions independently of the query before relevance filtering begins.

If this is right

  • Evidence selection becomes independent of whether the query explicitly names the morphology, allowing the agent to surface relevant regions that query-first methods overlook.
  • The same online memory serves both the initial scan and the final readout, providing consistent slide-level context without additional modules.
  • Navigation stays training-free because only frozen feature extractors and relevance scorers are used throughout.
  • Evidence trajectories are produced by the two-stage process and are therefore directly inspectable by a pathologist.
  • The design operates under a fixed inspection budget by limiting high-magnification visits to the surprise pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same surprise-first scan could be tested on other gigapixel domains such as remote-sensing imagery where initial anomaly detection narrows later query-driven search.
  • Replacing the frozen feature extractor with a newer one would directly update the surprise field without any retraining of the agent itself.
  • The approach implicitly assumes that low-magnification features correlate strongly with the presence of high-magnification diagnostic morphology; measuring that correlation on held-out slides would quantify the assumption's strength.
  • If the shared memory is made persistent across multiple slides from the same case, the agent could carry context between slides without extra engineering.

Load-bearing premise

The low-magnification surprise field reliably contains the regions holding the decisive high-resolution morphology needed to answer the query, even when that morphology is never mentioned in the question.

What would settle it

Run the agent on a curated set of slides where pathologists have pre-marked the exact high-resolution patches containing the diagnostic features; if the surprise field excludes more than a small fraction of those patches on average, the claimed accuracy and efficiency gains would not appear.

Figures

Figures reproduced from arXiv: 2605.23559 by Chen Li, Chunze Yang, Di Zhang, Jiashuai Liu, Jiusong Ge, Junbo Lu, Lei Wu, Ni Zhang, Qidong Liu, Wenjie Zhao, Xian Wu, Yue Tang, Zeyu Gao.

Figure 1
Figure 1. Figure 1: The comparison between existing patterns and the proposed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PathNavigate as a scan-search-readout pathology agent. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relative WSI-VQA cost for training-free agents. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic pathway comparison. PathAgent follows weak tissue and predicts squamous [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Agent-level WSI-VQA comparison between the PathAgent row and PathNavigate under [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task-wise SlideBench-BCNB comparison between Patho-R1 alone and PathNavigate un [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative workflow case for PathNavigate on WSI-VQA. From left to right, the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Compact full-set ablation impact summary. Each bar shows the total-accuracy delta against [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: α sweep on WSI-VQA. The default α=0.5 lies on the empirical peak, while the endpoint drops support the interpretation that PLIP is most useful as a within-pool refinement signal rather than as the primary proposal rule [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Candidate-pool constructor effect on BCNB. Bars show task-wise deltas relative to the [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Detailed per-question cost profile for training-free agents, complementing the main-paper [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces PathNavigate, a training-free pathology agent for whole-slide image visual question answering (WSI-VQA). It proposes a scan-search-readout routine: a low-magnification scan with a shared online memory module over frozen pathology features produces a slide-specific surprise field marking an abnormal-region pool; question-conditioned PLIP relevance is then restricted to this pool to select high-magnification targets; finally, local high-magnification evidence is extracted and the query is answered by a frozen perceptor-adjudicator stack that reuses the online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB are stated to demonstrate improved answer accuracy, more interpretable evidence-selection trajectories, and higher efficiency. The code is released publicly.

Significance. If the results hold, the work provides a practical training-free alternative to supervised multimodal LLMs for pathology VQA by addressing the limitation of question-first designs that can miss decisive morphology unnamed in the query. The surprise-guided navigation and shared memory mechanism could improve both accuracy and interpretability in gigapixel extreme-context search settings. The public code release supports reproducibility.

major comments (1)
  1. [Experiments] Experiments section: the central premise that the low-magnification surprise field (produced by shared online memory over frozen features) reliably contains the pool of regions holding query-decisive high-resolution morphology is not supported by any reported recall metric of ground-truth evidence patches inside the surprise pool, nor by an ablation isolating the surprise step or quantification of the surprise field (e.g., reconstruction error or density threshold). This assumption is load-bearing for the claim that restricting PLIP search to the pool improves accuracy over question-first baselines; if subtle high-res features are excluded at low mag, the pipeline cannot recover them.
minor comments (1)
  1. [Abstract] Abstract: asserts performance gains on WSI-VQA and SlideBench-BCNB but supplies no metrics, baselines, statistical details, or ablation results, which hinders immediate assessment of the claimed improvements.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive comment on the experimental validation of the surprise-guided mechanism. We address the concern point by point below.

read point-by-point responses
  1. Referee: Experiments section: the central premise that the low-magnification surprise field (produced by shared online memory over frozen features) reliably contains the pool of regions holding query-decisive high-resolution morphology is not supported by any reported recall metric of ground-truth evidence patches inside the surprise pool, nor by an ablation isolating the surprise step or quantification of the surprise field (e.g., reconstruction error or density threshold). This assumption is load-bearing for the claim that restricting PLIP search to the pool improves accuracy over question-first baselines; if subtle high-res features are excluded at low mag, the pipeline cannot recover them.

    Authors: We agree that a direct recall metric on ground-truth evidence patches would provide stronger support for the assumption. However, neither WSI-VQA nor SlideBench-BCNB provides region-level annotations of ground-truth evidence patches, so such a recall cannot be computed without new labeling. The reported end-to-end accuracy gains, efficiency improvements, and qualitative trajectory analysis offer indirect support for the scan-search-readout design. We will add an ablation isolating the surprise step together with quantification of the surprise field (e.g., density thresholds) in the revision. revision: partial

standing simulated objections not resolved
  • Reporting recall of ground-truth evidence patches inside the surprise pool, because the evaluation datasets lack the required region-level annotations.

Circularity Check

0 steps flagged

No significant circularity; empirical design with no derivations or fitted predictions

full rationale

The paper describes an empirical training-free agent architecture (scan with low-mag surprise field from shared online memory over frozen features, followed by question-conditioned search within that pool, then readout). No equations, parameter fits, uniqueness theorems, or self-citation chains are present in the provided text. The central premise is an engineering choice whose validity is tested via experiments on WSI-VQA and SlideBench-BCNB rather than derived from prior inputs. This matches the reader's assessment of negligible circularity burden. The design is self-contained against external benchmarks and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5863 in / 1099 out tokens · 21793 ms · 2026-05-25T05:03:00.889425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 8 internal anchors

  1. [1]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and V ahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663,

  2. [2]

    Pathagent: Toward interpretable analysis of whole-slide pathol- ogy images via large language model-based agentic reasoning

    Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Y ongbing Zhang. Pathagent: Toward interpretable analysis of whole-slide pathol- ogy images via large language model-based agentic reasoning. arXiv preprint arXiv:2511.17052, 2025a. Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Y ...

  3. [3]

    Gsco: Towards generalizable ai in medicine via generalist-specialist collaboration

    Sunan He, Y uxiang Nie, Hongmei Wang, Shu Y ang, Yihui Wang, Zhiyuan Cai, Zhixuan Chen, Yingxue Xu, Luyang Luo, Huiling Xiang, et al. Gsco: Towards generalizable ai in medicine via generalist-specialist collaboration. arXiv preprint arXiv:2404.15127,

  4. [4]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286,

  5. [5]

    Survagent: Hierarchical cot-enhanced case banking and dichotomy-based multi- agent system for multimodal survival prediction

    Guolin Huang, Wenting Chen, Jiaqi Y ang, Xinheng Lyu, Xiaoling Luo, Sen Y ang, Xiaohan Xing, and Linlin Shen. Survagent: Hierarchical cot-enhanced case banking and dichotomy-based multi- agent system for multimodal survival prediction. arXiv preprint arXiv:2511.16635,

  6. [6]

    Act like a pathologist: Tissue-aware whole slide image reasoning

    Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wen- chao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, et al. Act like a pathologist: Tissue-aware whole slide image reasoning. arXiv preprint arXiv:2603.00667,

  7. [7]

    GPT-4o System Card

    11 Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

  8. [8]

    Multimodal retrieval-augmented generation with large language models for medical vqa

    AHM Karim and Ozlem Uzuner. Multimodal retrieval-augmented generation with large language models for medical vqa. arXiv preprint arXiv:2510.13856,

  9. [9]

    A co-evolving agentic ai system for medical imaging analysis

    Songhao Li, Jonathan Xu, Tiancheng Bao, Y uxuan Liu, Y uchen Liu, Yihang Liu, Lilin Wang, Wen- hui Lei, Sheng Wang, Yinuo Xu, et al. A co-evolving agentic ai system for medical imaging analysis. arXiv preprint arXiv:2509.20279,

  10. [10]

    AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960,

  11. [11]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. MedGemma technical report. arXiv preprint arXiv:2507.05201,

  12. [12]

    OpenAI GPT-5 System Card

    Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13183–13192, 2024a. Mehmet Saygin Seyfioglu, Wisdom O Ikezo...

  13. [13]

    Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology

    12 Y uxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Y e Zhang, Zhongyi Shui, Tao Lin, and Lin Y ang. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10360–10371, 2025a. Y uxu...

  14. [14]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  15. [15]

    Chatcad: Interac- tive computer-aided diagnosis on medical image using large language models

    Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interac- tive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257,

  16. [16]

    Qwen3 Technical Report

    An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

  17. [17]

    Ypathrag: A retrieval-augmented generation framework and bench- mark for pathology

    Deshui Y u, Yizhi Wang, Saihui Jin, Taojie Zhu, Fanyi Zeng, Wen Qian, Zirui Huang, Jingli Ouyang, Jiameng Li, Zhen Song, et al. Ypathrag: A retrieval-augmented generation framework and bench- mark for pathology. arXiv preprint arXiv:2510.08603,

  18. [18]

    join PathNavigate reproduced scan-search- readout Patho-R1-7B ( Zhang et al. , 2026b) / Qwen3.5- 4B (Y ang et al., 2025), Appendix B.1 Pretrained MLLM rows in the main tables include closed-source API endpoints and local Qwen thumbnail runs, all using the same single-thumbnail (Slide-T) visual input. The closed-source rows are identified by the exact endpo...