pith. sign in

arxiv: 2606.22002 · v1 · pith:RGV7AO2Nnew · submitted 2026-06-20 · 💻 cs.CV · cs.LG

One-Shot Data Selection for Medical Image Classification via Graph Coverage

Pith reviewed 2026-06-26 12:47 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords data selectionmedical image classificationone-shot selectionk-nearest neighbor graphheat diffusion kernelfacility locationclass imbalancebalanced accuracy
0
0 comments X

The pith

A two-term coverage kernel on k-NN graphs from frozen embeddings selects subsets that maximize manifold coverage and raise balanced accuracy on medical image tasks without any training during selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that one-shot data selection for medical image classification can be performed by building a global k-nearest neighbor graph on pretrained embeddings and then applying a sparse two-term kernel derived from heat diffusion to guide greedy facility-location selection. This approach is meant to capture both direct and two-hop neighborhood relations across the entire dataset, including cross-class structure that per-class methods overlook. A reader would care because expert annotation of medical images is costly, so a method that picks high-value subsets in a single pass without repeated model training could cut labeling budgets while preserving classifier performance, especially on imbalanced data.

Core claim

Given embeddings from a pretrained encoder, the method constructs a k-nearest neighbor graph over all training samples, derives a two-term coverage kernel from the heat diffusion kernel that approximates full spectral behavior through sparse matrix operations, and uses greedy facility location on this kernel to choose class-balanced subsets that maximize coverage of the data manifold; the resulting subsets produce the highest balanced accuracy on nine of ten dataset-ratio conditions across five MedMNIST datasets, with the largest gains on class-imbalanced collections.

What carries the argument

The two-term coverage kernel on the k-NN graph, which encodes direct and two-hop neighborhood relationships for facility-location selection.

If this is right

  • The method requires no model training or oracle access during the selection step itself.
  • Global graph construction yields larger gains on class-imbalanced datasets than methods that operate per class.
  • The two-term kernel reproduces the selection behavior of the full heat kernel while remaining computationally cheap.
  • Performance is measured on five MedMNIST collections covering histopathology, radiology, and microscopy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same kernel construction were applied to embeddings from a domain-specific fine-tuned model, the coverage quality might increase further.
  • The approach could serve as an initialization step for subsequent active-learning rounds rather than a standalone one-shot solution.
  • Failure on a new dataset would most likely trace back to the pretrained embeddings failing to reflect the target manifold rather than to the kernel or selection algorithm.

Load-bearing premise

Embeddings from a generic pretrained foundation model preserve enough local manifold structure of the specific medical datasets that selections based on the k-NN graph and two-term kernel will improve downstream classifier performance.

What would settle it

On a held-out medical imaging dataset, the subsets chosen by the graph-coverage method produce classifiers whose balanced accuracy falls below that of random selection or the strongest baseline under the same annotation budget.

Figures

Figures reproduced from arXiv: 2606.22002 by Nadia Badawi, Nazar Zaki, Rafat Damseh, Zahiriddin Rustamov.

Figure 1
Figure 1. Figure 1: Overview of graph coverage selection (illustrated on RetinaMNIST). A frozen UNI encoder maps images to embeddings. A k-NN graph is built over all samples, and the coverage kernel K captures both direct and 2-hop neighborhood structure (center panel: ⋆ selected, • 1-hop, ◦ 2-hop). Greedy facility location selects a balanced coreset maximizing total coverage. 2.1 Neighborhood Graph from Embeddings The relati… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of global graph construction on DermaMNIST. (a) Full dataset in t-SNE space; the majority class (brown) dominates. (b) Per-class selection clusters within individual class regions. (c) Global selection distributes samples into cross-class overlap zones. biguous samples, which are informative for pruning large datasets but counter￾productive at low budgets where the model needs representative prototy… view at source ↗
read the original abstract

Training medical image classifiers on entire datasets is wasteful when annotation budgets are limited: not all samples contribute equally, yet acquiring expert labels is expensive. Active learning reduces annotation cost through iterative querying, but assumes repeated access to an oracle and requires multiple rounds of model training. One-shot geometry-based methods such as facility location avoid retraining but operate on pairwise distances that ignore the local structure of the data manifold. We propose a graph-based one-shot selection method that operates entirely on frozen foundation model embeddings. Given embeddings from a pretrained encoder, we construct a k-nearest neighbor graph over all training samples and derive a two-term coverage kernel from the heat diffusion kernel, capturing both direct and two-hop neighborhood relationships. Greedy facility location on this kernel selects class-balanced subsets that maximize coverage of the data manifold. The two-term kernel matches the full spectral heat kernel in selection behavior while reducing computation to sparse matrix operations with a single hyperparameter. We evaluate on five MedMNIST datasets spanning histopathology, radiology, and microscopy, comparing against both training-dynamics and geometry-based baselines. Our method achieves the highest balanced accuracy on nine of ten dataset-ratio conditions, with the largest gains on class-imbalanced datasets where global graph construction captures cross-class structure that per-class methods miss, all without any model training during selection. Code is available at https://github.com/zahiriddin-rustamov/graph-coverage-selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a one-shot data selection method for medical image classification that constructs a k-NN graph on frozen pretrained foundation-model embeddings, derives a two-term coverage kernel from the heat diffusion kernel to capture direct and two-hop neighborhoods, and applies greedy facility location to select class-balanced subsets maximizing manifold coverage. Evaluated on five MedMNIST datasets (histopathology, radiology, microscopy), the method reports the highest balanced accuracy on nine of ten dataset-ratio conditions versus training-dynamics and geometry-based baselines, with largest gains on imbalanced data, and releases code at the provided GitHub link.

Significance. If the central empirical claim holds after addressing the embedding assumption, the work offers a training-free, one-shot alternative to active learning that efficiently handles annotation budgets in medical imaging, particularly for imbalanced classes via global graph structure. Credit is due for the code release (reducing circularity risk), the reduction to sparse operations with only k and one kernel hyperparameter, and the focus on cross-class manifold coverage that per-class methods miss.

major comments (2)
  1. [Evaluation / Experimental Setup] Evaluation section: no validation is provided of whether the generic pretrained embeddings preserve local manifold structure on the MedMNIST datasets (e.g., neighborhood preservation metrics such as trustworthiness or continuity, or ablation against domain-specific encoders). This assumption is load-bearing for the central claim, because the k-NN graph and two-term kernel selections are only useful downstream if the embedding geometry aligns with the data manifold; distortion would render the reported balanced-accuracy gains un-attributable to the coverage kernel.
  2. [Results] Results (and abstract): the superiority claim (highest balanced accuracy on 9/10 conditions) provides no statistical significance tests, run-to-run variance, or exact baseline implementation details (e.g., hyperparameter choices or training protocols for the dynamics-based comparators). This is load-bearing for verifying the performance edge, especially on imbalanced subsets where small differences could arise from implementation variance rather than the proposed kernel.
minor comments (2)
  1. [Methods] The single kernel hyperparameter is mentioned but its value, selection procedure, and sensitivity analysis are not detailed; adding this (with ablation) would improve reproducibility without altering the core contribution.
  2. [Method] Notation for the two-term coverage kernel (direct vs. two-hop terms) could be clarified with an explicit equation reference to distinguish it from the full spectral heat kernel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Evaluation / Experimental Setup] Evaluation section: no validation is provided of whether the generic pretrained embeddings preserve local manifold structure on the MedMNIST datasets (e.g., neighborhood preservation metrics such as trustworthiness or continuity, or ablation against domain-specific encoders). This assumption is load-bearing for the central claim, because the k-NN graph and two-term kernel selections are only useful downstream if the embedding geometry aligns with the data manifold; distortion would render the reported balanced-accuracy gains un-attributable to the coverage kernel.

    Authors: We agree that the manuscript does not include explicit validation of manifold preservation in the embeddings. This is a valid observation. We will revise the evaluation section to add neighborhood preservation metrics (trustworthiness and continuity) on the MedMNIST embeddings and include an ablation against available domain-specific encoders to support the assumption. revision: yes

  2. Referee: [Results] Results (and abstract): the superiority claim (highest balanced accuracy on 9/10 conditions) provides no statistical significance tests, run-to-run variance, or exact baseline implementation details (e.g., hyperparameter choices or training protocols for the dynamics-based comparators). This is load-bearing for verifying the performance edge, especially on imbalanced subsets where small differences could arise from implementation variance rather than the proposed kernel.

    Authors: We agree that statistical tests, variance reporting, and detailed baseline protocols are necessary for robust claims. We will revise the results section and supplementary material to report means and standard deviations over multiple random seeds, add statistical significance tests (e.g., paired tests), and provide exact hyperparameter settings and training protocols for all baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained.

full rationale

The paper defines a one-shot selection procedure directly from k-NN graph construction on frozen embeddings followed by a two-term kernel and greedy facility location. No step reduces a claimed prediction or result to a fitted parameter or self-citation by construction; the method is algorithmic and evaluated empirically on external datasets without internal fitting loops that would force equivalence between inputs and outputs. The central claim rests on the empirical performance of this explicit construction rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the quality of generic pretrained embeddings representing medical image manifolds and on the two-term kernel being a sufficient approximation for selection quality.

free parameters (2)
  • k (nearest neighbors)
    Hyperparameter controlling graph sparsity; chosen once per dataset but not fitted to the final accuracy metric.
  • single kernel hyperparameter
    Controls the two-term coverage kernel; the only tunable value mentioned.
axioms (1)
  • domain assumption Embeddings from a pretrained foundation model preserve the local structure of the target medical image data manifold.
    The entire pipeline operates exclusively on these frozen embeddings without any domain-specific fine-tuning or validation of manifold fidelity.
invented entities (1)
  • two-term coverage kernel no independent evidence
    purpose: Approximates the full spectral heat kernel to capture direct and two-hop neighborhoods using only sparse operations.
    Newly defined in the paper as a computational shortcut; no independent evidence outside the selection task is provided.

pith-pipeline@v0.9.1-grok · 5790 in / 1422 out tokens · 32453 ms · 2026-06-26T12:47:31.551849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references

  1. [1]

    Medical Image Analysis71, 102062 (2021) Graph Coverage Selection 9

    Budd, S., Robinson, E.C., Kainz, B.: A survey on active learning and human-in- the-loop deep learning for medical image analysis. Medical Image Analysis71, 102062 (2021) Graph Coverage Selection 9

  2. [2]

    Nature Medicine30(3), 850–862 (2024)

    Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang,J.J.,Vaidya,A.,Le,L.P.,Gerber,G.,Sahai,S.,Williams,W.,Mahmood,F.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30(3), 850–862 (2024)

  3. [3]

    In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021

    Chen, R.J., Lu, M.Y., Shaban, M., Chen, C., Chen, T.Y., Williamson, D.F.K., Mahmood, F.: Whole slide images are 2d point clouds: Context-aware survival prediction using patch-based graph convolutional networks. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 339–349. Springer (2021)

  4. [4]

    Journal of the American Medi- cal Informatics Association30(6), 1079–1090 (2023).https://doi.org/10.1093/ jamia/ocad055

    Chinn, E., Arora, R., Arnaout, R., Arnaout, R.: Enriching medical imaging train- ing sets enables more efficient machine learning. Journal of the American Medi- cal Informatics Association30(6), 1079–1090 (2023).https://doi.org/10.1093/ jamia/ocad055

  5. [5]

    In: Balinski, M.L., Hoffman, A.J

    Fisher, M.L., Nemhauser, G.L., Wolsey, L.A.: An analysis of approximations for maximizing submodular set functions—II. In: Balinski, M.L., Hoffman, A.J. (eds.) Polyhedral Combinatorics, pp. 73–87. Springer, Berlin, Heidelberg (1978)

  6. [6]

    In: DEXA

    Guo, C., Zhao, B., Bai, Y.: Deepcore: A comprehensive library for coreset selection in deep learning. In: DEXA. LNCS, vol. 13108, pp. 181–195. Springer (2022)

  7. [7]

    In: CVPR (2016)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  8. [8]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Hong, Y., Zhang, X., Zhang, X., Zhou, J.T.: Evolution-aware variance (eva) core- set selection for medical image classification. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 301–310. MM ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/ 3664647.3681592

  9. [9]

    In: 2025 IEEE 22nd Inter- national Symposium on Biomedical Imaging (ISBI) (2025).https://doi.org/10

    Ji, A., Kang, Q., Xu, W., Wang, C., Li, K., Lao, Q.: Confounder-aware medical data selection for fine-tuning pretrained vision models. In: 2025 IEEE 22nd Inter- national Symposium on Biomedical Imaging (ISBI) (2025).https://doi.org/10. 1109/ISBI60581.2025.10980785

  10. [10]

    In: ICML (2002)

    Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces. In: ICML (2002)

  11. [11]

    In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023

    Liu, H., Li, H., Yao, X., Fan, Y., Hu, D., Dawant, B.M., Nath, V., Xu, Z., Oguz, I.: Colossal: A benchmark for cold-start active learning for 3d medical image seg- mentation. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 25–34. Springer (2023)

  12. [12]

    Mathematical Programming14, 265–294 (1978)

    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming14, 265–294 (1978)

  13. [13]

    Medical Image Analysis75, 102264 (2022)

    Pati, P., Jaume, G., Foncubierta-Rodríguez, A., Feroce, F., Anniciello, A.M., Scog- namiglio, G., Brancati, N., Fiche, M., Dubruc, E., Riccio, D., Di Bonito, M., De Pietro, G., Botti, G., Thiran, J.P., Frucci, M., Goksel, O., Gabrani, M.: Hierarchi- cal graph representations in digital pathology. Medical Image Analysis75, 102264 (2022)

  14. [14]

    In: Advances in Neural Information Processing Systems

    Paul, M., Ganguli, S., Dziugaite, G.K.: Deep learning on a data diet: Finding im- portant examples early in training. In: Advances in Neural Information Processing Systems. vol. 34, pp. 20596–20607 (2021)

  15. [15]

    Settles, B.: Active learning literature survey. Tech. rep., University of Wisconsin– Madison (2009) 10 Z. Rustamov et al

  16. [16]

    In: Advances in Neural Information Processing Systems

    Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., Morcos, A.: Beyond neural scaling laws: Beating power law scaling via data pruning. In: Advances in Neural Information Processing Systems. vol. 35, pp. 19523–19536. Curran Associates, Inc. (2022)

  17. [17]

    In: ICLR (2019)

    Toneva, M., Sordoni, A., Combes, R.T.d., Trischler, A., Bengio, Y., Gordon, G.J.: An empirical study of example forgetting during deep neural network learning. In: ICLR (2019)

  18. [18]

    Medical Image Analysis95, 103201 (2024)

    Wang, H., Jin, Q., Li, S., Liu, S., Wang, M., Song, Z.: A comprehensive survey on deep active learning in medical image analysis. Medical Image Analysis95, 103201 (2024)

  19. [19]

    In: ICML

    Wei, K., Iyer, R., Bilmes, J.: Submodularity in data subset selection and active learning. In: ICML. vol. 37, pp. 1954–1963. PMLR (2015)

  20. [20]

    In: ICML (2009)

    Welling, M.: Herding dynamical weights to learn. In: ICML (2009)

  21. [21]

    Scientific Data10(1), 41 (2023)

    Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., Ni, B.: Medmnist v2–a large-scale lightweight benchmark for 2d and 3d biomedical image classifica- tion. Scientific Data10(1), 41 (2023)

  22. [22]

    In: ICCV Workshops (2019)

    Zhou,Y.,Graham,S.,AlemiKoohbanani,N.,Shaban,M.,Heng,P.A.,Rajpoot,N.: Cgc-net: Cell graph convolutional network for grading of colorectal cancer histology images. In: ICCV Workshops (2019)