PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy

Aaditya Baranwal; Md Jahid Hasan; Shruti Vyas

arxiv: 2606.22890 · v1 · pith:OEEC6ZD4new · submitted 2026-06-22 · 💻 cs.CV

PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy

Aaditya Baranwal , Md Jahid Hasan , Shruti Vyas This is my paper

Pith reviewed 2026-06-26 09:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords phase-contrast microscopybacterial identificationopen-world recognitionmulti-label classificationbenchmark datasetanchor-based decoderleave-combinations-out

0 comments

The pith

Gradient-trained per-image aggregators for bacterial identification drop 0.39-0.57 F1 on unseen species combinations while anchor-based decoders do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PHOEBI, a dataset of 120000 phase-contrast images spanning 40 combinations of six rod-shaped bacterial species, together with a leave-combinations-out protocol that holds out entire mixtures to simulate open-world field samples. It reports that every tested gradient-trained aggregator suffers a large F1 collapse on the held-out split, locating the failure in the aggregation step rather than in the underlying image features. Linear probes across thirteen encoders show only a six-point F1 spread, confirming the representations remain stable. Three lightweight anchor-based decoders are introduced that operate geometrically over a frozen tile-feature pool and achieve higher scores on the unseen combinations than on the in-distribution validation set.

Core claim

On the leave-combinations-out split, every gradient-trained per-image aggregator drops between 0.39 and 0.57 F1 relative to its in-distribution performance. This drop is attributed to the aggregator architecture itself rather than the visual representation, because linear probes of thirteen different encoders over the same features vary by only about six percentage points of F1. Three anchor-based decoders are proposed that capture per-species presence geometrically over a shared frozen tile-feature pool; these decoders score higher on held-out combinations than on in-distribution validation.

What carries the argument

Anchor-based decoders that capture per-species presence geometrically over a shared frozen tile-feature pool

If this is right

The performance gap between in-distribution and held-out mixtures is driven primarily by decoder design rather than by the choice of visual encoder.
A single frozen feature pool extracted from any of several standard encoders can support effective multi-label prediction for bacterial mixtures.
Models can be trained on catalogued mixtures and still identify species in practical samples that contain previously unseen combinations.
Geometric modeling of species presence allows higher accuracy on novel polymicrobial samples than on the mixtures seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LCO-style protocol could be applied to other multi-label microscopy tasks where the set of possible co-occurring objects is open.
Clinical or environmental pipelines could adopt the frozen-pool plus anchor-decoder pattern to reduce retraining costs when new species appear.
The six-point spread across encoders suggests that further gains are more likely to come from decoder innovation than from larger pretraining corpora.
If the geometric anchoring mechanism generalizes, it may reduce the need for end-to-end fine-tuning in other open-set recognition settings.

Load-bearing premise

The observed F1 drop on held-out combinations is caused by the aggregator architecture rather than by unmeasured differences in data distribution, label noise, or species visual similarity.

What would settle it

An experiment in which the anchor-based decoders also exhibit a comparable F1 drop on the LCO split, or in which the six-point linear-probe spread widens substantially when distribution shifts are controlled.

Figures

Figures reproduced from arXiv: 2606.22890 by Aaditya Baranwal, Md Jahid Hasan, Shruti Vyas.

**Figure 1.** Figure 1: The PHOEBI compositional collapse, and how a single frozen tile-feature pool closes it. Left: one six-species mixture. Centre: model collapse on the leave-combinations-out (LCO) split vs. PHOEBI decoders, evaluated under the identical protocol. Right: the same simplex residual unlocks open-set rejection and novel-class discovery without further training. Abstract Optical microscopy enables rapid, label-fre… view at source ↗

**Figure 2.** Figure 2: Pure-culture appearance of the six PHOEBI species. bs and bt are thin rods that overlap in width and density; ka is short, stocky, encapsulated and morphologically isolated; mx and fj are mid-length rods; pf is a short, slightly curved rod. Bacterial-length statistics in §3.1. This work makes three contributions ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Data collection. Our four-step data collection approach for culture in suspension and Phase Contrast Microscopy (PCM) imaging. 3.1 Phase-contrast Optical bEnchmark for Bacterial Identification (PHOEBI) Dataset All forty cultures were cultured in a sterile lab environment to ensure label reliability and complete control over species composition; no existing public source provides the required combinatorial … view at source ↗

**Figure 4.** Figure 4: Combinatorial structure of the PHOEBI dataset. Each column represents one of 40 combinations of six species grouped by combination order; filled cells indicate species presence. 3.2 Benchmark Protocol We release two evaluation protocols. A random 80/10/10 image-level split with a fixed seed gives in-distribution closed-set characterization. The leave-combinations-out (LCO) split, around which the experimen… view at source ↗

**Figure 5.** Figure 5: Empirical evidence for Assumption H. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: In-distribution characterisation. (a) Per-class F1 on the random test split; ka is easiest and bt hardest across all three decoders. (b) Per-sample F1 vs tile count; all curves are monotone and saturating, consistent with O(1/T) variance reduction under Assumption H [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Phase-contrast through a Bayer colour camera is true RGB with a systematic warm cast, [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Held-out F1 by combination order (mean ± std across seeds 1337–1339). The grey band marks the supervised baseline range [0.44, 0.61]. SIMPLEXUNMIX and PROTOMATCH exceed the supervised ceiling at every order ≥ 2 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Per-class precision–recall curves on the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Residual-norm distributions for SIMPLEXUNMIX across the six LOOCV folds, known (blue) vs. unknown (red). The ka fold is the most striking: held-out ka tiles land in a lower-residual region than in-distribution tiles, producing anti-discriminative AUROC (0.066); ka features lie inside the convex hull spanned by the remaining five prototypes, so the simplex reconstructs them with small residual even though … view at source ↗

**Figure 11.** Figure 11: Per-species reliability diagrams on the LCO held-out test set. None of the three decoders [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Optical microscopy enables rapid, label-free imaging of live bacteria and is the standard instrument for species identification across clinical, environmental, and industrial microbiology. Yet field samples are routinely polymicrobial and may contain organisms that were never seen during system training, and no computer-vision benchmark tests multi-label species identification from phase-contrast microscopy (PCM) of such mixtures. We introduce Phase-contrast Optical bEnchmark for Bacterial Identification ($\textbf{PHOEBI}$), a wet-lab-prepared dataset of $120{,}000$ PCM images covering $40$ combinations of six rod-shaped species, paired with a leave-combinations-out (LCO) evaluation protocol that holds out entire species combinations to mirror the practical scenario of a model trained on catalogued mixtures that must generalise to unseen ones. On LCO, every gradient-trained per-image aggregator we test drops $0.39$ to $0.57$ F1 from the in-distribution to the held-out split, a systematic open-world recognition failure in the aggregator, not the visual representation. A linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1 across general-purpose and biomedical pretraining objectives, confirming the representation is sound. We propose three lightweight $\textit{anchor-based}$ decoders that capture per-species presence geometrically over a shared frozen tile-feature pool, scoring $\textit{higher}$ on held-out combinations than on in-distribution validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHOEBI supplies a useful new benchmark and LCO protocol for multi-label bacterial ID in phase-contrast images, but the claim that the performance drop is purely an aggregator problem rests on evidence that does not test the held-out case.

read the letter

The main thing here is the dataset itself: 120,000 PCM images across 40 combinations of six rod-shaped species, built with a leave-combinations-out split that actually matches the real-world problem of unseen mixtures. That setup is new and fills a gap that standard ImageNet-style or single-species benchmarks miss.

The paper shows that common per-image aggregators lose 0.39–0.57 F1 when moving from in-distribution to held-out combinations. It also reports that a linear probe across thirteen encoders varies by only about six points. Those two observations are worth having on record.

The soft spot is the attribution. The linear-probe spread is presented as proof that the visual representation is sound and the failure sits in the aggregator. Nothing in the abstract indicates those probes were evaluated on the LCO held-out splits; they read as in-distribution only. If that is the case, the small spread only shows the encoders are comparable on seen mixtures and does not address whether the frozen tile features remain useful for unseen combinations. The three anchor-based decoders are offered as better, yet without a side-by-side showing that linear probes on the same features also fail on LCO while the new heads succeed, the isolation of the problem to architecture is not secured.

This is the sort of benchmark paper that groups working on clinical or industrial microbiology imaging would want to see. It deserves peer review because the dataset and protocol are concrete and address a documented practical need, even if the causal claim about aggregators versus features needs tighter ablations and clearer reporting on which splits the probes used.

Referee Report

1 major / 0 minor

Summary. The paper introduces PHOEBI, a dataset of 120,000 phase-contrast microscopy images spanning 40 combinations of six rod-shaped bacterial species, together with a leave-combinations-out (LCO) protocol that holds out entire species mixtures. It reports that every tested gradient-trained per-image aggregator drops 0.39–0.57 F1 on LCO splits relative to in-distribution performance, attributes the failure to aggregator architecture rather than the visual representation on the basis of a linear-probe spread of ~6 F1 points across 13 encoders, and proposes three lightweight anchor-based decoders that achieve higher F1 on held-out combinations than on in-distribution validation.

Significance. If the attribution of the performance drop to aggregator choice is secured, the work supplies a reproducible open-world benchmark and evaluation protocol for multi-label bacterial identification in PCM that directly mirrors clinical and environmental use cases. The empirical demonstration of systematic aggregator failure, the small linear-probe variation, and the proposed geometric decoders that improve on LCO constitute concrete, falsifiable contributions that could become a reference point for testing generalization in microbiology computer vision.

major comments (1)

[Abstract] Abstract: the claim that the representation is sound because 'a linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1' does not specify whether these probes were run on the LCO held-out splits. If the probes are reported only on in-distribution data, the 6-point spread demonstrates only that the encoders are comparable on seen combinations and does not test whether the frozen tile features remain discriminative for unseen species mixtures, leaving the isolation of failure to the aggregator unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying an ambiguity in the abstract. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the representation is sound because 'a linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1' does not specify whether these probes were run on the LCO held-out splits. If the probes are reported only on in-distribution data, the 6-point spread demonstrates only that the encoders are comparable on seen combinations and does not test whether the frozen tile features remain discriminative for unseen species mixtures, leaving the isolation of failure to the aggregator unsupported.

Authors: We agree that the abstract does not specify the evaluation split. The linear probes were performed on the in-distribution validation set. This supports encoder comparability on seen combinations but does not directly demonstrate that the frozen tile features remain discriminative under the LCO protocol. To strengthen the attribution of failure to the aggregator, we will revise the abstract to qualify the claim and will add linear-probe results computed on the LCO held-out splits (both in the main text and supplementary material) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark comparisons are self-contained

full rationale

The paper introduces a new dataset and LCO protocol, then reports direct performance numbers for gradient-trained aggregators versus linear probes and anchor-based decoders on the same held-out splits. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the central claim. The isolation of failure to aggregator architecture rests on observed F1 differences rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the evaluation protocol itself is the main methodological contribution.

pith-pipeline@v0.9.1-grok · 5796 in / 1115 out tokens · 34102 ms · 2026-06-26T09:09:52.012100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 1 internal anchor

[1]

doi: 10.3389/frai.2025.1632344. José M. Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader, and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379,

work page doi:10.3389/frai.2025.1632344 2025
[2]

Richard J

doi: 10.1038/nmeth.4397. Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature Medicine,

work page doi:10.1038/nmeth.4397
[3]

Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images

Hervé Goëau, Vincent Espitalier, Pierre Bonnet, and Alexis Joly. Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum,

2024
[4]

Overview of LifeCLEF 2024: Challenges on species distribution prediction and identification

Alexis Joly, Lukáš Picek, Stefan Kahl, Hervé Goëau, Vincent Espitalier, Christophe Botella, et al. Overview of LifeCLEF 2024: Challenges on species distribution prediction and identification. InExperimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2024), Lecture Notes in Computer Science. Springer,

2024
[5]

Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

Shilong Liu, Tianhe Ren, Jiemin Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, and Lei Zhang. Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

work page arXiv
[6]

Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0–1 cost

Lukáš Picek, Milan Šulc, and Jiˇrí Matas. Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0–1 cost. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum,

2024
[7]

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman

doi: 10.1109/ITC-CSCC.2019.8793320. Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Generalized category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7492–7501,

work page doi:10.1109/itc-cscc.2019.8793320 2019
[8]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

12 A Front-End and Decoder Implementation Details A.1 Tile pipeline For an image I: Ω→R 3 we estimate a per-channel backgroundBc =G σ ∗I c via a large-σ Gaussian and form ˜Ic =I c/(Bc/ ¯B), picking σ= 64 px so cellular structure (5 to 20 px) is preserved while the lamp gradient (hundreds of px) is captured. We sample T= 16 tiles of side s= 224 per image (...

2016
[10]

Cache reuse across folds keeps the cost of a full sweep at one feature-extraction pass; the per-fold inner loop is sub-second on a single GPU. Algorithm 1PHOEBI leave-one-out cross-validation (LOOCV) open-set + discovery sweep Require:Train/val/test splits; tile config;Kspecies 1:Extract tile features on train/val/test once (cache-reuse across folds) 2:fo...

work page arXiv 2024
[11]

0.0 0.5 1.0Precision bs base=0.42 A (0.56) B (0.50) bt base=0.33 A (0.32) B (0.35) fj base=0.47 A (0.47) B (0.60) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0Precision ka base=0.47 A (0.82) B (0.91) 0.0 0.2 0.4 0.6 0.8 1.0 Recall mx base=0.47 A (0.55) B (0.41) 0.0 0.2 0.4 0.6 0.8 1.0 Recall pf base=0.45 A (0.49) B (0.47) A — simplex unmix B — proto match cl...

2022
[12]

unknown (red)

0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0density held-out bs 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out bt 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out fj 0.55 0.60 0.65 0.70 mean tile residual ‖r(x)‖ 0 10density held-out ka 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out mx 0.4 0.5 0.6 0.7 ...

work page arXiv
[13]

The decoder-level evidence directly mirrors the cross-method LCO evidence and rules out an alternative explanation in which the gap-closing is a feature of frozen-backbone training in general; only the geometrically-anchored decoders close the gap. Boundary-tile robustness check for H.A direct probe of whether H breaks at the field-of-view boundary: re-ru...

2021
[14]

makes this ordering a property of inter-species geometry on phase-contrast bacteria rather than of DINOV2. Exact-match accuracy collapses to <0.10 on quadruples and above for all decoders, a structural artifact of independent per-class thresholding (at F1 = 0.80 per class, six-class exact match is bounded by 0.806 ≈0.26 ); exact match is therefore a side ...

2021

[1] [1]

doi: 10.3389/frai.2025.1632344. José M. Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader, and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379,

work page doi:10.3389/frai.2025.1632344 2025

[2] [2]

Richard J

doi: 10.1038/nmeth.4397. Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Andrew H. Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature Medicine,

work page doi:10.1038/nmeth.4397

[3] [3]

Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images

Hervé Goëau, Vincent Espitalier, Pierre Bonnet, and Alexis Joly. Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum,

2024

[4] [4]

Overview of LifeCLEF 2024: Challenges on species distribution prediction and identification

Alexis Joly, Lukáš Picek, Stefan Kahl, Hervé Goëau, Vincent Espitalier, Christophe Botella, et al. Overview of LifeCLEF 2024: Challenges on species distribution prediction and identification. InExperimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2024), Lecture Notes in Computer Science. Springer,

2024

[5] [5]

Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

Shilong Liu, Tianhe Ren, Jiemin Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, and Lei Zhang. Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

work page arXiv

[6] [6]

Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0–1 cost

Lukáš Picek, Milan Šulc, and Jiˇrí Matas. Overview of FungiCLEF 2024: Revisiting fungi species recognition beyond 0–1 cost. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum,

2024

[7] [7]

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman

doi: 10.1109/ITC-CSCC.2019.8793320. Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Generalized category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7492–7501,

work page doi:10.1109/itc-cscc.2019.8793320 2019

[8] [8]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

12 A Front-End and Decoder Implementation Details A.1 Tile pipeline For an image I: Ω→R 3 we estimate a per-channel backgroundBc =G σ ∗I c via a large-σ Gaussian and form ˜Ic =I c/(Bc/ ¯B), picking σ= 64 px so cellular structure (5 to 20 px) is preserved while the lamp gradient (hundreds of px) is captured. We sample T= 16 tiles of side s= 224 per image (...

2016

[10] [10]

Cache reuse across folds keeps the cost of a full sweep at one feature-extraction pass; the per-fold inner loop is sub-second on a single GPU. Algorithm 1PHOEBI leave-one-out cross-validation (LOOCV) open-set + discovery sweep Require:Train/val/test splits; tile config;Kspecies 1:Extract tile features on train/val/test once (cache-reuse across folds) 2:fo...

work page arXiv 2024

[11] [11]

0.0 0.5 1.0Precision bs base=0.42 A (0.56) B (0.50) bt base=0.33 A (0.32) B (0.35) fj base=0.47 A (0.47) B (0.60) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.5 1.0Precision ka base=0.47 A (0.82) B (0.91) 0.0 0.2 0.4 0.6 0.8 1.0 Recall mx base=0.47 A (0.55) B (0.41) 0.0 0.2 0.4 0.6 0.8 1.0 Recall pf base=0.45 A (0.49) B (0.47) A — simplex unmix B — proto match cl...

2022

[12] [12]

unknown (red)

0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0density held-out bs 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out bt 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out fj 0.55 0.60 0.65 0.70 mean tile residual ‖r(x)‖ 0 10density held-out ka 0.4 0.5 0.6 0.7 mean tile residual ‖r(x)‖ 0.0 2.5 5.0 held-out mx 0.4 0.5 0.6 0.7 ...

work page arXiv

[13] [13]

The decoder-level evidence directly mirrors the cross-method LCO evidence and rules out an alternative explanation in which the gap-closing is a feature of frozen-backbone training in general; only the geometrically-anchored decoders close the gap. Boundary-tile robustness check for H.A direct probe of whether H breaks at the field-of-view boundary: re-ru...

2021

[14] [14]

makes this ordering a property of inter-species geometry on phase-contrast bacteria rather than of DINOV2. Exact-match accuracy collapses to <0.10 on quadruples and above for all decoders, a structural artifact of independent per-class thresholding (at F1 = 0.80 per class, six-class exact match is bounded by 0.806 ≈0.26 ); exact match is therefore a side ...

2021