pith. sign in

arxiv: 2605.30716 · v1 · pith:3S4TSQ7Rnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

Pith reviewed 2026-06-28 23:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pathology report generationvision-language modelwhole-slide imagestoken efficiencysynoptic reportscase-level reasoningmulti-WSIpatch encoding
0
0 comments X

The pith

A simple vision-language model generates case-level pathology reports from multiple whole-slide images using low-magnification patches and half a GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a minimal vision-language model that produces structured synoptic reports for entire pathology cases from multiple gigapixel whole-slide images. It freezes a patch encoder, adds a lightweight two-layer MLP aligner, and uses an LLM decoder with an explicit marker token to separate slides in one sequence. Training occurs in two supervised stages on heterogeneous pairs first then case-report pairs, while representing each slide with 512 by 512 patches at 5 times magnification to shorten sequences by up to 64 times. This design enables practical training under tight memory limits and delivers high automated metric scores with preference over baselines in evaluations. A reader would care because the approach lowers the hardware barrier for case-level reasoning over heterogeneous tissues without requiring full high-magnification token streams.

Core claim

The model uses a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder together with an explicit WSI marker token to separate slides. Representing each slide with 512 by 512 patches at 5 times magnification reduces average sequence length by up to 64 times. Two-stage training first aligns the vision-language components on WSI captioning pairs and then performs case-level supervised fine-tuning on case-report pairs, yielding high ROUGE-L, METEOR, and BLEU-4 scores while training on only half an NVIDIA H100 GPU and producing outputs preferred over strong baselines in AI-based evaluations.

What carries the argument

The explicit WSI marker token that separates multiple slides within a single input sequence, allowing the decoder to perform case-level reasoning over heterogeneous multi-WSI inputs.

If this is right

  • The two-stage training produces high ROUGE-L, METEOR, and BLEU-4 scores on case-level report generation.
  • The model is consistently preferred over strong baselines in AI-based evaluations of report quality.
  • Sequence-length reduction and efficient techniques make full training practical on only half a GPU.
  • Ablation studies identify simple design choices that improve robustness when handling multiple WSIs per case.
  • Performance-efficiency trade-offs are mapped across different patch resolutions and training configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-magnification patch strategy and marker token could be tested on other high-resolution medical imaging tasks such as radiology report generation to achieve similar token savings.
  • The two-stage alignment-then-fine-tuning recipe may transfer to other vision-language settings that must process long sequences from multiple sources.
  • If the generated reports match expert quality on human review, the approach could support draft report creation in pathology labs with modest hardware.
  • Releasing the model and training code as a reproducible baseline would allow direct measurement of gains from future encoder or decoder improvements.

Load-bearing premise

That 512 by 512 patches extracted at 5 times magnification retain enough diagnostic detail for accurate case-level synoptic report generation across heterogeneous tissues and ambiguous findings.

What would settle it

A controlled comparison in which pathologists review the same cases at higher magnification and identify diagnostic details systematically absent from reports generated solely from the 5 times patches.

Figures

Figures reproduced from arXiv: 2605.30716 by Jiahao Cheng, Mahdi S. Hosseini, Vincent Quoc-Huy Trinh, Zhiyuan Yang.

Figure 1
Figure 1. Figure 1: Standard patch-based WSI processing workflow in computational pathology. Gigapixel WSIs are tiled into image patches, encoded by a patch encoder, and either used for patch-level prediction or aggregated into slide-level representations. The large number of patches extracted from each WSI creates a major memory and computation bottleneck for slide- and case-level modeling. A100 GPUs). Taken together, these … view at source ↗
Figure 2
Figure 2. Figure 2: Our model follows the LLaVA-like [15] design with three major components: 1) a frozen patch-level encoder that embeds individual WSI patches into visual feature embeddings, 2) a two-layer MLP that projects visual feature embeddings to the lan￾guage token embedding space, and 3) a LLM that takes in a sequence of concatenated visual token embeddings with text tokens for report generation. When constructing t… view at source ↗
Figure 3
Figure 3. Figure 3: We have two training stages. In stage 1, everything is frozen, except for the two-layer MLP for vision-language alignment. We train via the WSI captioning task on individual WSIs, which requires the model to understand the visual tokens from each WSI and generate a summary/caption for the given WSI. In stage 2, we only freeze the patch encoder and finetune both the two-layer MLP and the LLM backbone. An ad… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset composition used in this work. The left panel shows retained WSI-text pairs in Stage 1. The right panel shows retained HISTAI cases per domain in Stage 2. Stage 2 is imbalanced, with skin and mixed-domain cases dominating the dataset. Data preprocessing on Stage 1. Stage 1 trains the aligner with paired WSI-to-text supervision from HistGen and REG2025 [5, 4]. HistGen provides TCGA-derived WSIs pair… view at source ↗
Figure 5
Figure 5. Figure 5: Prompts used in the two-stage training pipeline. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generated examples for stage 1 single-WSI captioning task on the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generated examples for stage 1 single-WSI captioning task on the [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generated examples for stage 2 case-level multi-WSI reports on the [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example report from our Stage 2 baseline, HistoGPT, WSI-LLaVA, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Stage 1 ablation summary across text metrics and observed runtime [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Stage 2 ablation summary for case-level structured reporting with multi￾WSI packing. The figure compares text metrics, field-level correctness, and average runtime across the baseline and B1–B4 settings. For each setting, results correspond to the checkpoint with the lowest validation loss [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a simple three-component VLM (frozen pathology patch encoder, two-layer MLP aligner, LLM decoder) with an explicit WSI marker token for case-level synoptic report generation from multi-WSI inputs. It uses two-stage supervised training (aligner-only WSI captioning followed by case-level SFT) and reduces token length by extracting 512×512 patches at 5× magnification (claimed 64× reduction vs. 20×), enabling training on half an H100 GPU. The central claims are high ROUGE-L/METEOR/BLEU-4 scores, consistent AI preference over baselines, and extensive ablations on performance-efficiency trade-offs in multi-WSI settings.

Significance. If the performance claims hold with proper validation, the work supplies a practical, reproducible baseline for token-efficient multi-WSI VLMs under constrained compute, directly addressing gigapixel resolution and heterogeneous tissue challenges in pathology report generation.

major comments (2)
  1. [Abstract] Abstract (efficiency paragraph): The central claim of accurate case-level reasoning over heterogeneous tissues and ambiguous findings rests on the unvalidated assumption that 512×512 patches at 5× magnification retain sufficient diagnostic detail (e.g., nuclear pleomorphism, mitoses). No ablation versus 20× patches, no pathologist review, and no quantitative comparison of report quality at different magnifications are supplied, making this load-bearing for the reported metric scores.
  2. [Abstract] Abstract: The assertions of 'high ROUGE-L/METEOR/BLEU-4 scores' and 'consistently preferred over strong baselines' are presented without dataset sizes, case counts, train/test splits, baseline model specifications, statistical significance tests, or ablation tables. This prevents independent verification of the performance and efficiency claims.
minor comments (1)
  1. The abstract mentions 'extensive ablations' but provides no section or table references for them in the summary text; ensure all ablation results are explicitly linked to tables or figures in the main body.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below, agreeing where the presentation can be strengthened and outlining targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (efficiency paragraph): The central claim of accurate case-level reasoning over heterogeneous tissues and ambiguous findings rests on the unvalidated assumption that 512×512 patches at 5× magnification retain sufficient diagnostic detail (e.g., nuclear pleomorphism, mitoses). No ablation versus 20× patches, no pathologist review, and no quantitative comparison of report quality at different magnifications are supplied, making this load-bearing for the reported metric scores.

    Authors: We agree the abstract overstates the unvalidated assumption. The manuscript selects 5× for token reduction and validates via downstream report metrics and efficiency ablations, but contains no direct magnification ablation, no 20× comparison, and no pathologist review of patch diagnostic fidelity. We will revise the abstract to state the efficiency rationale explicitly, note that diagnostic sufficiency is inferred from report quality rather than direct visual validation, and add the absence of pathologist review as a limitation. revision: yes

  2. Referee: [Abstract] Abstract: The assertions of 'high ROUGE-L/METEOR/BLEU-4 scores' and 'consistently preferred over strong baselines' are presented without dataset sizes, case counts, train/test splits, baseline model specifications, statistical significance tests, or ablation tables. This prevents independent verification of the performance and efficiency claims.

    Authors: The abstract is a high-level summary; the full manuscript reports dataset sizes, case counts, splits, baseline details, and ablation tables in the Methods and Experiments sections. We will revise the abstract to include brief quantitative context (e.g., number of cases and WSIs) and add a pointer to the results tables. We will also incorporate statistical significance measures for the preference and metric comparisons in the revised results. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results from supervised training on held-out pairs

full rationale

The paper describes a standard supervised vision-language model trained in two stages on WSI-text and case-report pairs, with performance (ROUGE-L/METEOR/BLEU-4) measured on held-out data. No equations, derivations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear in the provided text. The architecture (frozen encoder + MLP aligner + LLM) and efficiency choices (512x512 patches at 5x) are presented as design decisions, not as outputs derived from prior results within the paper. Results are externally falsifiable via the reported benchmarks and ablations, making the work self-contained with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that low-magnification patches suffice for report generation and that a frozen encoder plus small aligner can be aligned to text via standard supervised losses; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption A frozen pathology patch encoder preserves sufficient visual features for downstream case-level report generation without further adaptation.
    The architecture description states the encoder remains frozen throughout both training stages.
  • domain assumption 512x512 patches at 5x magnification retain diagnostic information adequate for synoptic report generation despite reduced resolution.
    The efficiency paragraph explicitly chooses this magnification and patch size to achieve the 64x sequence-length reduction.

pith-pipeline@v0.9.1-grok · 5852 in / 1408 out tokens · 18310 ms · 2026-06-28T23:13:42.465657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2602.03998 (2026)

    Alagha, A., Leclerc, C., Kotp, Y., Metwally, O., Moras, C., Rentopoulos, P., Ros- tami, G., Nguyen, B.N., Baig, J., Khellaf, A., Trinh, V.Q.H., Mizouni, R., Otrok, H., Bentahar, J., Hosseini, M.S.: Atlaspatch: Efficient tissue detection and high- throughput patch extraction for computational pathology at scale. arXiv preprint arXiv:2602.03998 (2026)

  2. [2]

    Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., Wei, F.: Longnet: Scaling transformers to 1,000,000,000 tokens (2023), https://arxiv.org/abs/2307.02486

  3. [3]

    Nature Medicine pp

    Ding, T., Wagner, S.J., Song, A.H., Chen, R.J., Lu, M.Y., Zhang, A., Vaidya, A.J., Jaume,G.,Shaban,M.,Kim,A.,etal.:Amultimodalwhole-slidefoundationmodel for pathology. Nature Medicine pp. 1–13 (2025)

  4. [4]

    https://reg2025.grand-challenge.org/ (2025), -MICCAI Registered Challenge

    Grand Challenge: Report generation in pathology using pan-asia giga-pixel wsis (reg2025). https://reg2025.grand-challenge.org/ (2025), -MICCAI Registered Challenge. DOI: 10.5281/zenodo.15081613 (accessed 2026-02-24). 24 Z. Yang et al

  5. [5]

    In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

    Guo, Z., Ma, J., Xu, Y., Wang, Y., Wang, L., Chen, H.: HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction . In: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. vol. LNCS 15004. Springer Nature Switzerland (Oc- tober 2024)

  6. [6]

    In: Oguz, I., Zhang, S., Metaxas, D.N

    He, H., Hosseini, M.S., Wang, Y.: Pathttt: Test-time training with meta-auxiliary learning for pathology image classification. In: Oguz, I., Zhang, S., Metaxas, D.N. (eds.) Information Processing in Medical Imaging. pp. 33–46. Springer Nature Switzerland, Cham (2026)

  7. [7]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  8. [8]

    Journal of Pathology Informat- ics15, 100357 (2024)

    Hosseini, M.S., Bejnordi, B.E., Trinh, V.Q.H., Chan, L., Hasan, D., Li, X., Yang, S., Kim, T., Zhang, H., Wu, T., Chinniah, K., Maghsoud- lou, S., Zhang, R., Zhu, J., Khaki, S., Buin, A., Chaji, F., Salehi, A., Nguyen, B.N., Samaras, D., Plataniotis, K.N.: Computational pathology: A survey review and the way forward. Journal of Pathology Informat- ics15, ...

  9. [9]

    In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

  10. [10]

    In: Callison-Burch, C., Koehn, P., Fordyce, C.S., Monz, C

    Lavie, A., Agarwal, A.: METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Callison-Burch, C., Koehn, P., Fordyce, C.S., Monz, C. (eds.) Proceedings of the Second Workshop on Statisti- cal Machine Translation. pp. 228–231. Association for Computational Linguistics, Prague, Czech Republic (Jun 2007), htt...

  11. [11]

    Leviathan, Y., Kalman, M., Matias, Y.: Prompt repetition improves non-reasoning llms (2025), https://arxiv.org/abs/2512.14982

  12. [12]

    In: ICML (2023)

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liang, Y., Lyu, X., Chen, W., Ding, M., Zhang, J., He, X., Wu, S., Xing, X., Yang, S., Wang, X., Shen, L.: Wsi-llava: A multimodal large language model for whole slide image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22718–22727 (October 2025)

  14. [14]

    In: Text Summarization Branches Out

    Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguis- tics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013/

  15. [15]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  16. [16]

    Nature Medicine30, 863—-874 (2024)

    Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Liang, I., Ding, T., Jaume, G., Odintsov, I., Le, L.P., Gerber, G., et al.: A visual-language foundation model for computational pathology. Nature Medicine30, 863—-874 (2024)

  17. [17]

    Nechaev, D., Pchelnikov, A., Ivanova, E.: Histai: An open-source, large- scale whole slide image dataset for computational pathology (2025), https://arxiv.org/abs/2505.12120

  18. [18]

    https://openai.com/chatgpt/ (2026), aI chatbot

    OpenAI: Chatgpt. https://openai.com/chatgpt/ (2026), aI chatbot

  19. [19]

    Bleu: a method for automatic evaluation of machine translation

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Isabelle, P., Charniak, E., Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002). https://do...

  20. [20]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Press, O., Smith, N.A., Lewis, M.: Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 (2021)

  21. [21]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  22. [22]

    arXiv preprint arXiv:2405.10254 (2024)

    Shaikovski, G., Casson, A., Severson, K., Zimmermann, E., Wang, Y.K., Kunz, J.D., Retamero, J.A., Oakley, G., Klimstra, D., Kanan, C., et al.: Prism: A multi- modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254 (2024)

  23. [23]

    In: Hadjali, A., Maiorana, E., Gusikhin, O., Sansone, C

    Sharma, V., Alagha, A., Khellaf, A., Trinh, V.Q.H., Hosseini, M.S.: Investigating zero-shot diagnostic pathology in vision-language models with efficient prompt de- sign. In: Hadjali, A., Maiorana, E., Gusikhin, O., Sansone, C. (eds.) Deep Learning Theory and Applications. pp. 263–279. Springer Nature Switzerland, Cham (2025)

  24. [24]

    Contemporary Oncology19, A68 – A77 (2015), https://api.semanticscholar.org/CorpusID:12829250

    Tomczak, K., Czerwińska, P., Wiznerowicz, M.: The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology19, A68 – A77 (2015), https://api.semanticscholar.org/CorpusID:12829250

  25. [25]

    Nature Communications16(1), 4886 (2025)

    Tran, M., Schmidle, P., Guo, R.R., Wagner, S.J., Koch, V., Lupperger, V., Novotny, B., Murphree, D.H., Hardway, H.D., D’Amato, M., Lefkes, J., Geijs, D.J., Feuchtinger, A., Böhner, A., Kaczmarczyk, R., Biedermann, T., Amir, A.L., Mooyaart, A.L., Ciompi, F., Litjens, G., Wang, C., Comfere, N.I., Eyerich, K., Braun, S.A., Marr, C., Peng, T.: Generating derm...

  26. [26]

    Nature (2024)

    Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., Xu, Y., Wei, M., Wang, W., Ma, S., Wei, F., Yang, J., Li, C., Gao, J., Rosemon, J., Bower, T., Lee, S., Weerasinghe, R., Wright, B.J., Robicsek, A., Piening, B.,Bifulco, C., Wang, S., Poon, H.: Awhole-slide foundation model for digital pathology from...

  27. [27]

    In: International Conference on Learning Representa- tions (2020), https://openreview.net/forum?id=SkeHuCVFDr

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. In: International Conference on Learning Representa- tions (2020), https://openreview.net/forum?id=SkeHuCVFDr