pith. machine review for the scientific record. sign in

arxiv: 2604.27538 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Self-Supervised Learning of Plant Image Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningplant image recognitiondata augmentationfine-grained classificationfew-shot learningvision transformersbiodiversity monitoring
0
0 comments X

The pith

Domain-specific augmentations and plant-only data produce stronger self-supervised representations for fine-grained plant recognition than standard SSL practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard self-supervised augmentations such as Gaussian blur, grayscale conversion, and solarization erase the subtle visual cues that distinguish one plant species from another. Replacing those with transformations like affine changes and posterization, while training exclusively on plant images from the iNaturalist 2021 Plantae subset, yields representations that transfer better to downstream tasks. These models match or exceed the performance of strong supervised baselines on few-shot plant recognition, and the gains hold for both base and large vision transformer architectures. The work matters because expert labels for plants are scarce, so reliable label-free pretraining could scale biodiversity monitoring. It demonstrates that both the choice of data and the choice of transformations must be adapted to the target domain rather than borrowed from coarse-grained natural-image pipelines.

Core claim

Commonly used augmentations in SSL pipelines are detrimental for plant images because they remove subtle discriminative cues essential for fine-grained recognition; alternative transformations including affine and posterization are better suited, and training SimDINOv2 on the iNaturalist 2021 Plantae subset produces significantly stronger representations than training on ImageNet-1K, with the resulting models achieving competitive and sometimes superior performance to supervised baselines on downstream plant recognition tasks in few-shot settings across ViT-Base and ViT-Large.

What carries the argument

A domain-adapted SSL pipeline that substitutes destructive augmentations with affine and posterization transforms and restricts pretraining to a large plant-only image collection to preserve species-specific visual details.

If this is right

  • Plant-specific SSL pretraining supplies better initial weights for few-shot species recognition than general ImageNet pretraining.
  • Avoiding blur, grayscale, and solarization preserves the fine visual details required to separate closely related plant species.
  • Domain-matched datasets can outperform larger but less relevant corpora when learning representations for specialized visual tasks.
  • The same augmentation and data choices improve results consistently for both ViT-Base and ViT-Large backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle of replacing destructive augmentations may apply to other fine-grained biological domains such as insect or fungi identification.
  • Scaling the plant-only pretraining corpus beyond the 2021 iNaturalist subset could further reduce dependence on expert annotations for new species.
  • These results suggest that general-purpose pretraining corpora are suboptimal starting points whenever the target task involves subtle intra-class variation.

Load-bearing premise

The reported improvements stem specifically from the altered augmentations and the plant-only training set rather than from unstated differences in training schedule, hyperparameters, or evaluation protocols.

What would settle it

Retraining the identical architectures with the original augmentations and ImageNet-1K data under matched schedules and hyperparameters, then measuring whether few-shot downstream accuracy drops to match or fall below the plant-adapted models.

Figures

Figures reproduced from arXiv: 2604.27538 by Alexis Joly, Herv\'e Go\"eau, Ilyass Moummad, Jean-Christophe Lombardo, Kawtar Zaher, Pierre Bonnet.

Figure 1
Figure 1. Figure 1: Comparison of general-purpose data augmentations from self-supervised visual learning frameworks (e.g., SimCLR, BYOL, DINO) with our proposed augmentations for fine-grained plant species recognition. Standard techniques such as Gaussian blur, grayscale, and polarize destroy crucial plant information, whereas posterize and affine transformations preserve distinguishing features while still allowing visual d… view at source ↗
Figure 2
Figure 2. Figure 2: illustrates SimDINOv2 framework, showing both global and patch￾level self-distillation using cosine similarity. Self-Supervised Learning Teacher Network Student Network EMA Augment Self-Distillation Patches CLS Masked view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of feature embeddings on the Pl@ntNet dataset for different pretrained models. SimDINOv2 pretrained on ImageNet produces dispersed and over￾lapping class clusters, indicating poor separability of plant species. In contrast, SimDI￾NOv2 pretrained on iNaturalist Plantae forms compact and well-separated clusters, comparable to supervised approaches BioCLIP and Pl@ntCLEF. This highlights th… view at source ↗
read the original abstract

Automated plant recognition plays a crucial role in biodiversity monitoring and conservation, yet current approaches rely heavily on supervised learning, which is limited by the availability of expert-labeled data. Self-supervised learning (SSL) offers a scalable alternative, but existing methods and training protocols are largely designed for coarse-grained visual tasks and may not transfer well to fine-grained domains such as plant species recognition. In this work, we investigate SSL for plant image representation learning. We show that commonly used augmentations in SSL pipelines - such as Gaussian blur, grayscale conversion, and solarization - are detrimental in the context of plant images, as they remove subtle discriminative cues essential for fine-grained recognition. We instead identify alternative transformations, including affine and posterization, that are better suited to this domain. We further demonstrate that training SimDINOv2 on the iNaturalist 2021 Plantae subset yields significantly stronger representations than training on ImageNet-1K, highlighting the importance of domain-specific data for SSL. Our findings are consistent across both ViT-Base and ViT-Large architectures. Moreover, our models achieve competitive performance and sometimes outperform strong supervised baselines Pl@ntCLEF and BioCLIP on downstream plant recognition tasks in few-shot settings. Overall, our results highlight the critical importance of domain-adapted augmentation strategies and dataset selection in self-supervised learning, and provide practical guidelines for building scalable models for biodiversity monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents an empirical study on adapting self-supervised learning (SSL) techniques, specifically SimDINOv2, for learning representations from plant images. Key findings include that standard augmentations like Gaussian blur, grayscale conversion, and solarization degrade performance on fine-grained plant tasks by eliminating subtle cues, while alternatives such as affine transformations and posterization are more effective. Training on the iNaturalist 2021 Plantae subset is shown to outperform training on ImageNet-1K, with models achieving competitive or better results than supervised methods like Pl@ntCLEF and BioCLIP in few-shot downstream plant recognition tasks. Results are reported consistently for ViT-Base and ViT-Large architectures.

Significance. If the results hold with proper controls, the work would be significant for computer vision and biodiversity monitoring by highlighting the need for domain-specific SSL adaptations in fine-grained recognition. The identification of detrimental augmentations and benefits of in-domain data offer practical guidelines, with cross-architecture consistency (ViT-Base/Large) as a strength. The empirical focus on few-shot tasks adds value for data-scarce settings.

major comments (3)
  1. [Abstract] Abstract: The central claims of 'significantly stronger representations' from iNaturalist 2021 Plantae vs. ImageNet-1K training and competitive/outperforming results vs. Pl@ntCLEF/BioCLIP in few-shot settings lack any quantitative metrics, effect sizes, or statistical test details, undermining the ability to assess the magnitude and reliability of the reported gains.
  2. [Experiments section (augmentation and dataset ablations)] Experiments section (augmentation and dataset ablations): The attribution of gains specifically to the proposed augmentations (affine/posterization) and plant-only dataset is load-bearing, yet the manuscript provides no explicit confirmation that optimizer, learning-rate schedule, epochs, batch size, or other protocol elements were identical across conditions; without this, confounding cannot be excluded as noted in the abstract's comparisons.
  3. [Few-shot evaluation subsection] Few-shot evaluation subsection: Comparisons to supervised baselines Pl@ntCLEF and BioCLIP require explicit details on whether identical data splits, pretraining protocols, or evaluation procedures were used, as any mismatch would prevent attributing superiority to the SSL adaptations.
minor comments (1)
  1. [Introduction] The motivation section could more clearly contrast SimDINOv2 with other recent SSL variants to aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and have revised the paper accordingly to provide the requested details and confirmations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'significantly stronger representations' from iNaturalist 2021 Plantae vs. ImageNet-1K training and competitive/outperforming results vs. Pl@ntCLEF/BioCLIP in few-shot settings lack any quantitative metrics, effect sizes, or statistical test details, undermining the ability to assess the magnitude and reliability of the reported gains.

    Authors: We agree that the abstract would be strengthened by including specific quantitative support for the claims. In the revised version, we have updated the abstract to report key performance metrics (e.g., top-1 accuracy improvements on downstream few-shot tasks), approximate effect sizes relative to baselines, and a brief reference to the statistical significance tests detailed in the experiments section. These additions allow readers to immediately gauge the magnitude and reliability of the gains without altering the overall narrative. revision: yes

  2. Referee: [Experiments section (augmentation and dataset ablations)] Experiments section (augmentation and dataset ablations): The attribution of gains specifically to the proposed augmentations (affine/posterization) and plant-only dataset is load-bearing, yet the manuscript provides no explicit confirmation that optimizer, learning-rate schedule, epochs, batch size, or other protocol elements were identical across conditions; without this, confounding cannot be excluded as noted in the abstract's comparisons.

    Authors: All ablation studies were performed with an identical training protocol, including the same optimizer (AdamW), learning-rate schedule, total epochs, batch size, and other hyperparameters, as specified in the methods and implementation details sections. To eliminate any ambiguity, we have added an explicit statement in the experiments section confirming that the protocol was held constant across all augmentation and dataset conditions, thereby isolating the effects under study. revision: yes

  3. Referee: [Few-shot evaluation subsection] Few-shot evaluation subsection: Comparisons to supervised baselines Pl@ntCLEF and BioCLIP require explicit details on whether identical data splits, pretraining protocols, or evaluation procedures were used, as any mismatch would prevent attributing superiority to the SSL adaptations.

    Authors: We followed the exact data splits, few-shot sampling procedures, and evaluation protocols reported in the original Pl@ntCLEF and BioCLIP papers for fair comparison. We have now expanded the few-shot evaluation subsection to explicitly state this consistency, including references to the specific splits and evaluation code settings used, ensuring that any observed advantages can be attributed to the SSL adaptations rather than procedural differences. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical comparisons

full rationale

The paper reports experimental results on augmentation choices and dataset selection for SSL on plant images, with performance measured against external benchmarks (ImageNet-1K, iNaturalist Plantae, Pl@ntCLEF, BioCLIP). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described claims. All central assertions rest on direct empirical comparisons that can be replicated or falsified independently, satisfying the criteria for a self-contained non-circular study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that inherits standard assumptions from the self-supervised learning literature (contrastive or distillation objectives can learn transferable features) and from computer-vision practice (vision transformers are suitable backbones). No new entities or free parameters are introduced beyond those already present in SimDINOv2 and DINOv2.

axioms (1)
  • domain assumption Self-supervised objectives on unlabeled images can produce representations useful for downstream supervised tasks
    Core premise of all SSL methods cited in the abstract.

pith-pipeline@v0.9.0 · 5566 in / 1370 out tokens · 63827 ms · 2026-05-07T09:36:33.299065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

  1. [1]

    A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

    Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., et al.: A Cookbook of Self-Supervised Learning. arXiv preprint arXiv:2304.12210 (2023)

  2. [2]

    In: International Conference on Learning Representations (2022)

    Bardes, A., Ponce, J., LeCun, Y.: VICReg: Variance-Invariance-Covariance Regu- larization for Self-Supervised Learning. In: International Conference on Learning Representations (2022)

  3. [3]

    Ecological Solutions and Evidence1(2), e12023 (2020)

    Bonnet, P., Joly, A., Faton, J.M., Brown, S., Kimiti, D., Deneu, B., Servajean, M., Affouard, A., Lombardo, J.C., Mary, L., et al.: How citizen scientists contribute to monitor protected areas thanks to automatic plant identification tools. Ecological Solutions and Evidence1(2), e12023 (2020)

  4. [4]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Caron, M., Misra, I., Mairal,J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

  5. [5]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  6. [6]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Chai, A.Y.H., Jee, K.L.Z., Lee, S.H., Tay, F.S., Vandeputte, J., Goeau, H., Bonnet, P., Joly, A.: Deep-plant-disease dataset is all you need for plant disease identifica- tion. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 12578–12584. MM ’25, Association for Computing Machinery, New York, NY, USA (2025)

  7. [7]

    In: International Conference on Ma- chine Learning (ICML) (2020)

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Con- trastive Learning of Visual Representations. In: International Conference on Ma- chine Learning (ICML) (2020)

  8. [8]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Chen, X., He, K.: Exploring Simple Siamese Representation Learning. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  9. [9]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  10. [10]

    In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021) Self-Supervised Learning of Plant Image Representations 13

    Garcin, C., Joly, A., Bonnet, P., Affouard, A., Lombardo, J.C., Chouet, M., Ser- vajean, M., Lorieul, T., Salmon, J.: Pl@ntNet-300K: a plant image dataset with high label ambiguity and a long-tailed distribution. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021) Self-Supervised Learning of Plant Image Rep...

  11. [11]

    In: Working notes of the Conference and Labs of the Evaluation Forum (CLEF 2023)

    Goëau, H., Bonnet, P., Joly, A.: Overview of PlantCLEF 2023: Image-based Plant Identification at Global Scale. In: Working notes of the Conference and Labs of the Evaluation Forum (CLEF 2023). CEUR Workshop Proceedings, vol. 3497, pp. 1972–1981. CEUR-WS (Sep 2023)

  12. [12]

    In: Proceedings of the 21st ACM international conference on Multimedia

    Goëau, H., Bonnet, P., Joly, A., Bakić, V., Barbe, J., Yahiaoui, I., Selmi, S., Carré, J., Barthélémy, D., Boujemaa, N., et al.: Pl@ ntnet mobile app. In: Proceedings of the 21st ACM international conference on Multimedia. pp. 423–424 (2013)

  13. [13]

    In: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum

    Goeau, H., Espitalier, V., Bonnet, P., Joly, A.: Overview of PlantCLEF 2024: Multi-species plant identification in vegetation plot images. In: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings (2024)

  14. [14]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Pires, B., Guo, Z., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko,M.:BootstrapYourOwnLatent:ANewApproachtoSelf-SupervisedLearn- ing. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

  15. [15]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2020)

    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum Contrast for Unsuper- vised Visual Representation Learning. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2020)

  16. [16]

    https://github.com/voxel51/inaturalist-2021 (2021)

    iNaturalist 2021: inaturalist 2021 competition dataset. https://github.com/voxel51/inaturalist-2021 (2021)

  17. [17]

    https://www.inaturalist.org (2025)

    iNaturalist community: inaturalist. https://www.inaturalist.org (2025)

  18. [18]

    IEEE Transactions on Pattern Analysis and Machine Intelligence29(9), 1546–1562 (2007)

    Ma, Y., Derksen, H., Hong, W., Wright, J.: Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression. IEEE Transactions on Pattern Analysis and Machine Intelligence29(9), 1546–1562 (2007)

  19. [19]

    Methods in Ecology and Evolution12(7), 1335–1342 (2021)

    Mäder, P., Boho, D., Rzanny, M., Seeland, M., Wittich, H.C., Deggelmann, A., Wäldchen, J.: The Flora Incognita app – Interactive plant species identification. Methods in Ecology and Evolution12(7), 1335–1342 (2021)

  20. [20]

    In: Advances in Neural Information Processing Systems

    Moutakanni,T.,Oquab,M.,Szafraniec,M.,Vakalopoulou,M.,Bojanowski,P.:You Don´t Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning. In: Advances in Neural Information Processing Systems. vol. 37, pp. 116106–116125 (2024)

  21. [21]

    https://observation.org/apps/obsidentify/ (2026)

    Observation International: Obsidentify: Wildlife and plant identification app. https://observation.org/apps/obsidentify/ (2026)

  22. [22]

    Transactions on Machine Learning Research (TMLR) (2023)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jégou, H., Bojanowski, P., LeCun, Y., Caron, M.: DINOv2: Learning Robust Visual Features without Supervision. Tran...

  23. [23]

    In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)

  24. [24]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Stevens, S., Wu, J., Thompson, M.J., Campolongo, E.G., Song, C.H., Carlyn, D.E., Dong, L., Dahdul, W.M., Stewart, C., Berger-Wolf, T., et al.: BioCLIP: A Vision Foundation Model for the Tree of Life. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 19412–19424 (2024)

  25. [25]

    arXiv preprint arXiv:1911.04623 , year=

    Wang, Y., Chao, W.L., Weinberger, K.Q., Van Der Maaten, L.: SimpleShot: Re- visiting Nearest-Neighbor Classification for Few-Shot Learning. arXiv preprint arXiv:1911.04623 (2019) 14 I. Moummad et al

  26. [26]

    In: Proceedings of the International Conference on Machine Learning (ICML) (2025)

    Wu, Z., Zhang, J., Pai, D., Wang, X., Singh, C., Yang, J., Gao, J., Ma, Y.: Simpli- fying DINO via Coding Rate Regularization. In: Proceedings of the International Conference on Machine Learning (ICML) (2025)

  27. [27]

    In: International Conference on Machine Learning (ICML) (2021)

    Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In: International Conference on Machine Learning (ICML) (2021)