pith. sign in

arxiv: 2606.01543 · v1 · pith:P5ICPQYLnew · submitted 2026-06-01 · 💻 cs.CV

PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images

Pith reviewed 2026-06-28 15:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords pathology image synthesisautoregressive modelingmultimodal generationstructure factorizationvector quantizationmedical image generationdual tokenizer
0
0 comments X

The pith

PathAR explicitly factorizes structure and appearance tokens to generate anatomically coherent multimodal pathology images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PathAR to address data scarcity in multimodal pathology by synthesizing images that preserve morphological structure while varying appearance across modalities. It claims that current methods implicitly couple structure with appearance in a single token stream, reducing controllability when modalities shift. PathAR instead uses a dual vector quantization tokenizer to separate mask-grounded structure tokens from appearance tokens, then applies an interleaved autoregressive transformer with asymmetric attention to enforce structure-first generation conditioned on modality labels. A sympathetic reader would care because reliable synthetic data with preserved anatomy could support training of downstream models like segmenters in low-data settings without introducing structural artifacts.

Core claim

PathAR employs a dual vector quantization tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive transformer with asymmetric attention visibility to enforce structure-to-appearance dependence, stabilizing morphology under heterogeneous modality-specific appearances and enabling spatially aligned image-mask pair generation.

What carries the argument

Dual-VQ tokenizer that produces separate structure and appearance tokens, combined with an IAR transformer using asymmetric attention to enforce structure-to-appearance dependence.

If this is right

  • PathAR improves structural consistency and modality fidelity over baselines.
  • The method maintains sample diversity while supporting spatially aligned outputs.
  • It enables downstream segmentation training in data-scarce regimes.
  • The framework extends to finer-grained intra-modality organ-label variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The structure-first approach could apply to other medical imaging domains where anatomical features remain stable across different acquisition methods.
  • Conditional generation on modality labels might allow efficient adaptation without full model retraining for new imaging protocols.
  • Providing explicit structure tokens could support controllable editing of pathology images for augmentation tasks.

Load-bearing premise

Morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols, allowing explicit factorization of structure and appearance without loss of anatomical coherence.

What would settle it

Generating images from a modality pair where cellular topology or tissue boundaries visibly differ would produce misaligned masks or anatomically incoherent outputs if the factorization premise fails.

Figures

Figures reproduced from arXiv: 2606.01543 by Feng Chen, Guanyu Yang, Huazhu Fu, Jiahao Xia, Junzhang Huang, Meng Wang, Yuan Zhang.

Figure 1
Figure 1. Figure 1: Motivation of PathAR for pathology image generation conditioned on the modality label. (a) Entangled stochastic generation can match modality-specific appearance but often disrupts morphology under multimodal heterogeneity. (b) PathAR uses factorized generation: it splits tokens into structure and appearance streams and enforces a structure-first autoregressive dependency to better maintain morphological c… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PathAR. Stage 1: Dual-VQ factorizes pathological images into structure and appearance token streams, where the former is anchored by mask-guided reconstruction to yield discrete, semantically rich tokens. Stage 2: The interleaved autoregressive transformer (IAR) models the modality￾conditional joint token distribution via next-token prediction under asymmetric attention visibility, enforcing st… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of multimodal pathology generation at 256 × 256. Top: (L+M)2I. Bottom: L2I. Generally, (L+M)2I stabilizes structure but may distort appearance across modalities, whereas L2I preserves modality appearance but often degrades structure without spatial anchors. PathAR (ours) maintains both structure and appearance in L2I and additionally generates an auxiliary mask. Representative defect… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on PanNuke. Generated samples from representative methods are shown across the 19 organ labels of PanNuke. At inference, (L+M)2I methods are given the organ label and the ground-truth mask to generate an image, whereas L2I methods are given only the organ label to generate an image. PathAR maintains stable cellular morphology and fine-grained histological texture across diverse categ… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of original a d idd b Dual-VQ reconstruction examples. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-sample swapping of structure and appearance tokens across modalities. Each row presents one token-swapping example constructed from two real samples of different modalities. (a) and (b) denote two real images indexed by p and q, which are encoded into structure–appearance token pairs (Sp, Ap) and (Sq, Aq), respectively, where S and A denote the structure and appearance token streams. After token swap… view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE analysis of generated images and sampling hyperparameters. (a–d) Feature t-SNE. t-SNE visualizations of UNI embeddings for real and generated images. (a–c) Per-modality alignment: PathAR (orange) closely overlaps with Real (green) distributions. (d) Joint embedding of PathAR samples demonstrates distinct inter-modality separation and rich intra-modality diversity. (e–f) Hyperparameter tuning. Impact … view at source ↗
read the original abstract

Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes PathAR, a structure-first autoregressive framework for synthesizing modality-specific pathology images while preserving anatomical coherence. It introduces a Dual-VQ tokenizer that decomposes inputs into mask-grounded structure tokens and appearance tokens, paired with an interleaved autoregressive (IAR) transformer using asymmetric attention to enforce strict structure-to-appearance ordering. The approach is motivated by data scarcity in multimodal pathology and claims to stabilize morphology under heterogeneous appearances, enable spatially aligned image-mask generation, and improve structural consistency, modality fidelity, sample diversity, and downstream segmentation performance over baselines.

Significance. If the empirical results hold, the explicit factorization of structure and appearance via Dual-VQ and ordered autoregressive modeling represents a targeted advance for controllable generation in medical imaging. The ability to produce aligned multimodal pairs could directly aid data augmentation in data-scarce regimes and support tasks such as segmentation. The paper positions its results as direct validation of the factorization without loss of coherence.

major comments (1)
  1. [Abstract / Introduction] The central claim that Dual-VQ produces mask-grounded structure tokens that remain anatomically coherent across modality shifts rests on the precondition that morphological structures (cellular topology, tissue boundaries) are largely preserved across acquisition protocols. This assumption is stated in the abstract but requires a concrete quantitative test (e.g., cross-modality mask overlap or topology metrics on paired samples) to confirm it does not introduce coherence loss; without such evidence the factorization benefit cannot be isolated from the assumption.
minor comments (1)
  1. [Abstract] The abstract refers to 'extensive experiments' and 'improved structural consistency' without naming datasets, metrics (e.g., FID, Dice, structural similarity), or baseline methods; adding one or two quantitative highlights would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. Below we provide a point-by-point response to the single major comment.

read point-by-point responses
  1. Referee: [Abstract / Introduction] The central claim that Dual-VQ produces mask-grounded structure tokens that remain anatomically coherent across modality shifts rests on the precondition that morphological structures (cellular topology, tissue boundaries) are largely preserved across acquisition protocols. This assumption is stated in the abstract but requires a concrete quantitative test (e.g., cross-modality mask overlap or topology metrics on paired samples) to confirm it does not introduce coherence loss; without such evidence the factorization benefit cannot be isolated from the assumption.

    Authors: We agree that a direct quantitative validation of cross-modality structural preservation would strengthen the presentation. The assumption is standard in the pathology literature because different modalities (e.g., H&E versus IHC) target the same underlying cellular and tissue morphology. Our Dual-VQ explicitly grounds structure tokens in segmentation masks that are intended to be modality-agnostic, and the reported gains in structural consistency metrics plus downstream segmentation performance provide indirect empirical support for the factorization. However, the training datasets used in the manuscript are unpaired across modalities, precluding the exact overlap or topology metrics suggested. We will add a dedicated paragraph in the revised introduction and methods sections that (i) cites supporting pathology references for the assumption and (ii) explicitly notes the lack of paired cross-modality evaluation as a limitation. This change will better isolate the contribution of Dual-VQ while remaining faithful to the available data. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PathAR as a structure-first autoregressive model that uses a Dual-VQ tokenizer to produce mask-grounded structure tokens and appearance tokens, followed by an IAR transformer with asymmetric attention to enforce ordering. No equations, fitted parameters, or self-citations are shown in the abstract or description that reduce any claimed prediction or uniqueness result to the inputs by construction. The factorization is presented as an explicit design choice whose success is evaluated empirically on structural consistency and modality fidelity, rather than assumed via prior self-referential theorems or renaming of known patterns. The morphological preservation assumption is acknowledged as an empirical precondition, not a derived axiom. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; all such elements are unknown.

pith-pipeline@v0.9.1-grok · 5733 in / 990 out tokens · 27819 ms · 2026-06-28T15:48:56.854742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Synthetic data and its utility in pathology and laboratory medicine,

    J. Pantanowitz, C. D. Manko, L. Pantanowitz, and H. H. Rashidi, “Synthetic data and its utility in pathology and laboratory medicine,” Laboratory Investigation, vol. 104, no. 8, p. 102095, 2024

  2. [2]

    Content generation models in computational pathology: A comprehen- sive survey on methods, applications, and challenges,

    Y . Zhang, X. Zhang, X. Qi, X. Wu, F. Chen, G. Yang, and H. Fu, “Content generation models in computational pathology: A comprehen- sive survey on methods, applications, and challenges,”IEEE Reviews in Biomedical Engineering, pp. 1–22, 2025

  3. [3]

    Clusterseg: A crowd cluster pinpointed nucleus segmentation framework with cross-modality datasets,

    J. Ke, Y . Lu, Y . Shen, J. Zhu, Y . Zhou, J. Huang, J. Yao, X. Liang, Y . Guo, Z. Weiet al., “Clusterseg: A crowd cluster pinpointed nucleus segmentation framework with cross-modality datasets,”Medical Image Analysis, vol. 85, p. 102758, 2023

  4. [4]

    Structure- preserving color normalization and sparse stain separation for histologi- cal images,

    A. Vahadane, T. Peng, A. Sethi, S. Albarqouni, L. Wang, M. Baust, K. Steiger, A. M. Schlitter, I. Esposito, and N. Navab, “Structure- preserving color normalization and sparse stain separation for histologi- cal images,”IEEE Transactions on Medical Imaging, vol. 35, no. 8, pp. 1962–1971, 2016

  5. [5]

    Fluorescence confocal microscopy for pathologists,

    M. Ragazzi, S. Piana, C. Longo, F. Castagnetti, M. Foroni, G. Ferrari, G. Gardini, and G. Pellacani, “Fluorescence confocal microscopy for pathologists,”Modern Pathology, vol. 27, no. 3, pp. 460–471, 2014

  6. [6]

    Maskfactory: Towards high-quality synthetic data generation for dichotomous image segmentation,

    H. Qian, Y . Chen, S. Lou, F. Shahbaz Khan, X. Jin, and D.-P. Fan, “Maskfactory: Towards high-quality synthetic data generation for dichotomous image segmentation,”Advances in Neural Information Processing Systems, vol. 37, pp. 66 455–66 478, 2024

  7. [7]

    Distribution matching losses can hallucinate features in medical image translation,

    J. P. Cohen, M. Luck, and S. Honari, “Distribution matching losses can hallucinate features in medical image translation,” inInternational con- ference on medical image computing and computer-assisted intervention. Springer, 2018, pp. 529–536

  8. [8]

    On hallucinations in tomographic image reconstruction,

    S. Bhadra, V . A. Kelkar, F. J. Brooks, and M. A. Anastasio, “On hallucinations in tomographic image reconstruction,”IEEE Transactions on Medical Imaging, vol. 40, no. 11, pp. 3249–3260, 2021

  9. [9]

    Imagenet-trained cnns are biased towards texture; in- creasing shape bias improves accuracy and robustness,

    R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; in- creasing shape bias improves accuracy and robustness,” inInternational conference on learning representations, 2018

  10. [10]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  11. [11]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  12. [12]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  13. [13]

    Autoregressive visual tracking,

    X. Wei, Y . Bai, Y . Zheng, D. Shi, and Y . Gong, “Autoregressive visual tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9697–9706

  14. [14]

    Autoregressive models in vision: A survey,

    J. Xiong, G. Liu, L. Huang, C. Wu, T. Wu, Y . Mu, Y . Yao, H. Shen, Z. Wan, J. Huanget al., “Autoregressive models in vision: A survey,” arXiv preprint arXiv:2411.05902, 2024

  15. [15]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image gener- ation,”arXiv preprint arXiv:2406.06525, 2024

  16. [16]

    Emu3: Next-Token Prediction is All You Need

    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024

  17. [17]

    Janus: Decoupling visual encoding for unified multi- modal understanding and generation,

    C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruanet al., “Janus: Decoupling visual encoding for unified multi- modal understanding and generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 966–12 977

  18. [18]

    Visual autore- gressive modeling: Scalable image generation via next-scale prediction,

    K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autore- gressive modeling: Scalable image generation via next-scale prediction,” Advances in neural information processing systems, vol. 37, pp. 84 839– 84 865, 2024

  19. [19]

    Imagefolder: Autoregressive image generation with folded tokens,

    X. Li, K. Qiu, H. Chen, J. Kuen, J. Gu, B. Raj, and Z. Lin, “Imagefolder: Autoregressive image generation with folded tokens,” inThe Thirteenth International Conference on Learning Representations, 2025

  20. [20]

    Controlar: Controllable image generation with autoregressive models,

    Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang, “Controlar: Controllable image generation with autoregressive models,” inThe Thirteenth International Conference on Learning Rep- resentations, 2025

  21. [21]

    Nuclear morphological abnormalities in cancer: a search for unifying mechanisms,

    I. Singh and T. P. Lele, “Nuclear morphological abnormalities in cancer: a search for unifying mechanisms,” inNuclear, chromosomal, and genomic architecture in biology and medicine. Springer, 2022, pp. 443–467

  22. [22]

    Pathdiff: Histopathology image synthesis with unpaired text and mask conditions,

    M. Bhosale, A. Wasi, Y . Zhai, Y . Tian, S. Border, N. Xi, P. Sarder, J. Yuan, D. Doermann, and X. Gong, “Pathdiff: Histopathology image synthesis with unpaired text and mask conditions,” 2025. [Online]. Available: https://arxiv.org/abs/2506.23440

  23. [23]

    Nasdm: Nuclei-aware semantic histopathology image generation using diffusion models,

    A. Shrivastava and P. T. Fletcher, “Nasdm: Nuclei-aware semantic histopathology image generation using diffusion models,” ininterna- tional conference on medical image computing and computer-assisted intervention. Springer, 2023, pp. 786–796

  24. [24]

    A robust image segmentation and synthesis pipeline for histopathology,

    M. Jehanzaib, Y . Almalioglu, K. B. Ozyoruk, D. F. Williamson, T. Ab- dullah, K. Basak, D. Demir, G. E. Keles, K. Zafar, and M. Turan, “A robust image segmentation and synthesis pipeline for histopathology,” Medical Image Analysis, vol. 99, p. 103344, 2025

  25. [25]

    Accelerating histopathology workflows with generative ai- based virtually multiplexed tumour profiling,

    P. Pati, S. Karkampouna, F. Bonollo, E. Comp ´erat, M. Radi ´c, M. Spahn, A. Martinelli, M. Wartenberg, M. Kruithof-de Julio, and M. Rap- somaniki, “Accelerating histopathology workflows with generative ai- based virtually multiplexed tumour profiling,”Nature machine intelli- gence, vol. 6, no. 9, pp. 1077–1093, 2024

  26. [26]

    Virtual staining for histology by deep learning,

    L. Latonen, S. Koivukoski, U. Khan, and P. Ruusuvuori, “Virtual staining for histology by deep learning,”Trends in Biotechnology, vol. 42, no. 9, pp. 1177–1191, 2024

  27. [27]

    Deep learning- based transformation of h&e stained tissues into special stains,

    K. De Haan, Y . Zhang, J. E. Zuckerman, T. Liu, A. E. Sisk, M. F. Diaz, K.-Y . Jen, A. Nobori, S. Liou, S. Zhanget al., “Deep learning- based transformation of h&e stained tissues into special stains,”Nature communications, vol. 12, no. 1, p. 4884, 2021

  28. [28]

    Next token prediction towards multimodal in- telligence: A comprehensive survey,

    L. Chen, Z. Wang, S. Ren, L. Li, H. Zhao, Y . Li, Z. Cai, H. Guo, L. Zhang, Y . Xionget al., “Next token prediction towards multimodal in- telligence: A comprehensive survey,”arXiv preprint arXiv:2412.18619, 2024

  29. [29]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

  30. [30]

    Taming transformers for high- resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 12 873–12 883

  31. [31]

    Maskgit: Masked generative image transformer,

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 315–11 325

  32. [32]

    Structure and intensity unbiased translation for 2d medical image segmentation,

    T. Zhang, S. Zheng, J. Cheng, X. Jia, J. Bartlett, X. Cheng, Z. Qiu, H. Fu, J. Liu, A. Leonardiset al., “Structure and intensity unbiased translation for 2d medical image segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 060– 10 075, 2024

  33. [33]

    Semantic image synthesis with spatially-adaptive normalization,

    T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu, “Semantic image synthesis with spatially-adaptive normalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346

  34. [34]

    Efficient semantic image synthesis via class-adaptive normalization,

    Z. Tan, D. Chen, Q. Chu, M. Chai, J. Liao, M. He, L. Yuan, G. Hua, and N. Yu, “Efficient semantic image synthesis via class-adaptive normalization,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4852–4866, 2021

  35. [35]

    Edge guided gans with multi-scale contrastive learning for semantic image synthesis,

    H. Tang, G. Sun, N. Sebe, L. Van Goolet al., “Edge guided gans with multi-scale contrastive learning for semantic image synthesis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 14 435–14 452, 2023

  36. [36]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  37. [37]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

  38. [38]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

  39. [39]

    T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,

    C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304

  40. [40]

    Roformer: En- hanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  41. [41]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” 2022. [Online]. Available: https://arxiv.org/abs/2207.12598 14

  42. [42]

    PanNuke Dataset Extension, Insights and Baselines,

    J. Gamper, N. A. Koohbanani, K. Benes, S. Graham, M. Jahanifar, S. A. Khurram, A. Azam, K. Hewitt, and N. Rajpoot, “Pannuke dataset extension, insights and baselines,”arXiv preprint arXiv:2003.10778, 2020

  43. [43]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017

  44. [44]

    Demystifying MMD GANs

    M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demysti- fying mmd gans,”arXiv preprint arXiv:1801.01401, 2018

  45. [45]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  46. [46]

    Towards a general-purpose foundation model for computational pathology,

    R. J. Chen, T. Ding, M. Y . Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shabanet al., “Towards a general-purpose foundation model for computational pathology,”Nature medicine, vol. 30, no. 3, pp. 850–862, 2024

  47. [47]

    Measures of the amount of ecologic association between species,

    L. R. Dice, “Measures of the amount of ecologic association between species,”Ecology, vol. 26, no. 3, pp. 297–302, 1945

  48. [48]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

  49. [49]

    Im- proved precision and recall metric for assessing generative models,

    T. Kynk ¨a¨anniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila, “Im- proved precision and recall metric for assessing generative models,” Advances in neural information processing systems, vol. 32, 2019

  50. [50]

    Diffinfinite: Large mask-image synthesis via parallel random patch diffusion in histopathology,

    M. Aversa, G. Nobis, M. H ¨agele, K. Standvoss, M. Chirica, R. Murray- Smith, A. M. Alaa, L. Ruff, D. Ivanova, W. Sameket al., “Diffinfinite: Large mask-image synthesis via parallel random patch diffusion in histopathology,”Advances in Neural Information Processing Systems, vol. 36, pp. 78 126–78 141, 2023

  51. [51]

    Training generative adversarial networks with limited data,

    T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training generative adversarial networks with limited data,”Advances in neural information processing systems, vol. 33, pp. 12 104–12 114, 2020

  52. [52]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  53. [53]

    Efficientnet: Rethinking model scaling for con- volutional neural networks,

    M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- volutional neural networks,” inInternational conference on machine learning. PMLR, 2019, pp. 6105–6114

  54. [54]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,

    F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature methods, vol. 18, no. 2, pp. 203–211, 2021. VI. BIOGRAPHYSECTION Yuan ZhangYuan Zhang received the B.S. degree in 2019 from Northwestern Polytechnical University, Xi’an, China, and the M...