pith. sign in

arxiv: 2606.21838 · v1 · pith:IC3RJRNAnew · submitted 2026-06-20 · 💻 cs.CV · cs.LG

Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification

Pith reviewed 2026-06-26 12:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords hierarchical classificationcontrastive learningzero-shot learningfine-grained visiontaxonomic levelshierarchical consistency
0
0 comments X

The pith

Restricting contrastive comparisons to same taxonomic levels resolves hierarchical inconsistency in vision classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal contrastive learning for hierarchical labels often leads to inconsistent predictions across levels because mixing levels creates false negative pairs. Restricting comparisons to same-level categories and balancing groups across levels fixes this, leading to better consistency and accuracy from coarse to fine. This matters for tasks like species identification where predictions must respect the taxonomy. Experiments on iNaturalist 2021 show a 30.47% gain in average accuracy across levels when trained on TreeOfLife-10M.

Core claim

By restricting contrastive comparisons to categories within the same taxonomic level and adopting a group-balanced design, the proposed framework improves both hierarchical consistency and classification accuracy from coarse to fine granularity.

What carries the argument

Level-restricted contrastive comparisons with group-balanced sampling across taxonomic levels.

If this is right

  • Improved hierarchical consistency in both Euclidean and hyperbolic embedding spaces.
  • Higher average classification accuracy across levels on benchmarks like iNaturalist 2021.
  • Effective zero-shot performance for fine-grained hierarchical vision tasks.
  • Better optimization for each taxonomic level through group balancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This restriction might apply to hierarchical tasks in other fields like natural language processing or medical imaging.
  • Models could achieve consistency without post-hoc corrections for hierarchy violations.
  • Training on large hierarchical datasets like TreeOfLife could become standard for such models.

Load-bearing premise

False negative labels from multi-level contrastive comparisons cause the inconsistency, and same-level restriction plus balancing is sufficient to resolve it.

What would settle it

If a model trained without level restriction but with explicit false-negative correction achieves comparable consistency and accuracy, it would challenge the necessity of the restriction.

Figures

Figures reproduced from arXiv: 2606.21838 by Elizabeth G Campolongo, Hilmar Lapp, Jianyang Gu, Matthew J Thompson, Nathan Jacobs, Net Zhang, Srikumar Sastry, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su, Zhiyuan Tao, Ziheng Zhang.

Figure 1
Figure 1. Figure 1: Level-restricted contrastive learning. For each image, we construct taxonomic text labels at all hierarchical levels and encode images and texts using CLIP encoders. Instead of comparing text labels from all levels jointly, we constrain the contrastive comparison to be within the same level. This removes cross-level false negatives and enables balanced supervision across hierarchy levels. three species cla… view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of text embeddings. Our method [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Multimodal contrastive learning has enabled zero-shot visual classification by aligning images with textual categories. However, in hierarchically structured label spaces, existing methods often produce predictions that are inconsistent across taxonomic levels. For example, a model may predict a fine-grained category whose parent category contradicts its simultaneously predicted higher-level label. By analysis, the issue originates from false negative labels when contrastive comparison involves multiple taxonomic levels. To this end, we propose to restrict contrastive comparisons to categories within the same taxonomic level. In addition, we adopt a group-balanced design, ensuring each taxonomic level receives adequate optimization. As a result, the proposed framework improves both hierarchical consistency and classification accuracy from coarse to fine granularity. We train our model with TreeOfLife-10M based on BioCLIP and evaluate it across multiple hierarchical classification benchmarks, where the model demonstrates significantly improved hierarchical consistency in both Euclidean and hyperbolic spaces. Notably, on iNaturalist 2021 (iNat21), our method improves average accuracy across levels by 30.47% over the baseline, highlighting its effectiveness for hierarchical zero-shot classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper identifies false negative labels arising from multi-level contrastive comparisons as the source of hierarchical inconsistency in zero-shot vision classification (e.g., a fine-grained prediction whose parent label contradicts the coarse prediction). It proposes restricting contrastive pairs to same-taxonomic-level categories together with a group-balanced sampling design to ensure per-level optimization. The resulting model, trained on TreeOfLife-10M from BioCLIP, is evaluated on multiple hierarchical benchmarks and reports substantially higher hierarchical consistency plus a 30.47% gain in average accuracy across levels on iNaturalist 2021, with results shown in both Euclidean and hyperbolic embedding spaces.

Significance. If the central mechanism is isolated and the gains are reproducible, the work offers a lightweight, architecture-agnostic modification to contrastive objectives that directly targets a practical failure mode in hierarchical zero-shot classification. The reported scale of improvement on a large-scale biological dataset (iNat21) and compatibility with both Euclidean and hyperbolic spaces would make the approach immediately usable for fine-grained taxonomic tasks; the absence of new hyperparameters is also a practical strength.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (method): the claim that multi-level false negatives are the root cause of inconsistency is asserted by analysis but never isolated. No ablation removes cross-level negatives while retaining the multi-level label structure, nor compares against an alternative that keeps cross-level signal but mitigates false negatives differently; therefore the sufficiency of the level-restriction step remains untested.
  2. [Abstract / Evaluation] Abstract and Evaluation section: the 30.47% average-accuracy improvement on iNat21 is reported without error bars, multiple random seeds, or statistical significance tests. In addition, no quantitative definition or explicit metric for “hierarchical consistency” is supplied, so the magnitude of the consistency gain cannot be independently verified.
  3. [Abstract] Abstract: the paper does not examine whether discarding cross-level contrastive signal introduces new failure modes (e.g., reduced transfer between coarse and fine levels or degraded performance on datasets whose hierarchy is less balanced). The group-balanced design is presented as a remedy, but its interaction with the restriction is not ablated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major comments point by point below, and we will incorporate revisions to strengthen the paper where appropriate.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method): the claim that multi-level false negatives are the root cause of inconsistency is asserted by analysis but never isolated. No ablation removes cross-level negatives while retaining the multi-level label structure, nor compares against an alternative that keeps cross-level signal but mitigates false negatives differently; therefore the sufficiency of the level-restriction step remains untested.

    Authors: The analysis presented in the paper identifies the false negatives as arising specifically from cross-level comparisons in the contrastive objective. Our level-restricted approach directly addresses this by design. However, we agree that an explicit ablation isolating this mechanism would be beneficial. In the revised manuscript, we will include an additional experiment that attempts to retain multi-level structure while mitigating false negatives differently (e.g., by excluding known hierarchical relations from the negative set) to better test the sufficiency of the restriction. revision: yes

  2. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the 30.47% average-accuracy improvement on iNat21 is reported without error bars, multiple random seeds, or statistical significance tests. In addition, no quantitative definition or explicit metric for “hierarchical consistency” is supplied, so the magnitude of the consistency gain cannot be independently verified.

    Authors: We acknowledge these omissions. We will update the evaluation section to include a precise definition and formula for the hierarchical consistency metric. Additionally, we will conduct experiments with multiple random seeds, report standard deviations as error bars, and perform statistical significance tests for the reported improvements. revision: yes

  3. Referee: [Abstract] Abstract: the paper does not examine whether discarding cross-level contrastive signal introduces new failure modes (e.g., reduced transfer between coarse and fine levels or degraded performance on datasets whose hierarchy is less balanced). The group-balanced design is presented as a remedy, but its interaction with the restriction is not ablated.

    Authors: We agree that exploring potential drawbacks of discarding cross-level signals is important. In the revision, we will add a dedicated discussion on possible new failure modes and evaluate the method on additional hierarchical datasets with varying balance. We will also include an ablation study that isolates the contribution of the group-balanced sampling in conjunction with the level restriction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external validation

full rationale

The paper proposes level-restricted contrastive learning plus group balancing to address false-negative issues in hierarchical labels, then reports accuracy and consistency gains on iNat21 and other benchmarks after training on TreeOfLife-10M. No equations, uniqueness theorems, or self-citations are invoked that reduce the claimed improvements to the inputs by construction. The central results are measured outcomes against baselines rather than algebraic identities or fitted quantities renamed as predictions. Self-citations, if present, are not load-bearing for the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The method appears to rest on standard contrastive learning assumptions plus the new level-restriction rule.

pith-pipeline@v0.9.1-grok · 5771 in / 1186 out tokens · 23203 ms · 2026-06-26T12:35:09.710219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Emergent visual- semantic hierarchies in image-text representations

    Morris Alper and Hadar Averbuch-Elor. Emergent visual- semantic hierarchies in image-text representations. InEu- ropean Conference on Computer Vision, pages 220–238. Springer, 2024. 1

  2. [2]

    Making better mistakes: Leveraging class hierarchies with deep networks

    Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A Lord. Making better mistakes: Leveraging class hierarchies with deep networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12506–12515, 2020. 1

  3. [3]

    Springer Science & Business Media,

    Martin R Bridson and Andr ´e Haefliger.Metric spaces of non-positive curvature. Springer Science & Business Media,

  4. [4]

    Ohio supercomputer center,

    Ohio Supercomputer Center. Ohio supercomputer center,

  5. [5]

    Label relation graphs enhanced hierarchical residual network for hierarchical multi-granularity classification

    Jingzhou Chen, Peng Wang, Jian Liu, and Yuntao Qian. Label relation graphs enhanced hierarchical residual network for hierarchical multi-granularity classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4858–4867, 2022. 1

  6. [6]

    Hyper- bolic image-text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. Hyper- bolic image-text representations. InInternational Conference on Machine Learning, pages 7694–7731. PMLR, 2023. 1

  7. [7]

    Multi-level supervised contrastive learning.arXiv preprint arXiv:2502.02202, 2025

    Naghmeh Ghanooni, Barbod Pajoum, Harshit Rawal, So- phie Fellenz, V o Nguyen Le Duy, and Marius Kloft. Multi-level supervised contrastive learning.arXiv preprint arXiv:2502.02202, 2025. 1

  8. [8]

    White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su

    Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su. BioCLIP 2: Emer- gent properties from scaling hierarchical contrastive learning. InThe Thirty-ninth Annual Co...

  9. [9]

    Openclip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. 3

  10. [10]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational confer- ence on machine learning, pages 4904–4916. PMLR, 2021. 1

  11. [11]

    Hierarchical multi-granularity classification based on bidirec- tional knowledge transfer.Multimedia Systems, 30(4):207,

    Juan Jiang, Jingmin Yang, Wenjie Zhang, and Hongbin Zhang. Hierarchical multi-granularity classification based on bidirec- tional knowledge transfer.Multimedia Systems, 30(4):207,

  12. [12]

    Evaluation measures for hierarchical classification: a unified view and novel ap- proaches.Data Mining and Knowledge Discovery, 29(3): 820–865, 2015

    Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and Ion Androutsopoulos. Evaluation measures for hierarchical classification: a unified view and novel ap- proaches.Data Mining and Knowledge Discovery, 29(3): 820–865, 2015. 3

  13. [13]

    Learning label hierarchy with supervised contrastive learning

    Ruixue Lian, William Sethares, and Junjie Hu. Learning label hierarchy with supervised contrastive learning. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1569–1581, 2024. 1

  14. [14]

    Crypticbio: A large multimodal dataset for visually confusing species

    Georgiana Manolache, Gerard Schouten, and Joaquin Van- schoren. Crypticbio: A large multimodal dataset for visually confusing species. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2025. 2, 3

  15. [15]

    Poincar ´e embeddings for learning hierarchical representations.NeurIPS, 30, 2017

    Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learning hierarchical representations.NeurIPS, 30, 2017. 1

  16. [16]

    Learning continuous hierarchies in the lorentz model of hyperbolic geometry

    Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International conference on machine learning, pages 3779–

  17. [17]

    Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

    Avik Pal, Max Van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

  18. [18]

    Yu, Sara Beery, and Jonathan Huang

    Seulki Park, Youren Zhang, Stella X. Yu, Sara Beery, and Jonathan Huang. Visually consistent hierarchical image clas- sification. InThe Thirteenth International Conference on Learning Representations, 2025. 1

  19. [19]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

  20. [20]

    Accept the modality gap: An exploration in the hyperbolic space

    Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InCVPT, pages 27263– 27272, 2024. 1

  21. [21]

    Global and local entailment learning for natural world imagery

    Srikumar Sastry, Aayush Dhakal, Eric Xing, Subash Khanal, and Nathan Jacobs. Global and local entailment learning for natural world imagery. InICCV. IEEE/CVF, 2025. 1, 3, 4

  22. [22]

    Taxabind: A unified embedding space for ecological applications

    Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ah- mad, and Nathan Jacobs. Taxabind: A unified embedding space for ecological applications. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 1765–1774. IEEE, 2025. 1

  23. [23]

    A survey of hierarchi- cal classification across different application domains.Data mining and knowledge discovery, 22(1):31–72, 2011

    Carlos N Silla Jr and Alex A Freitas. A survey of hierarchi- cal classification across different application domains.Data mining and knowledge discovery, 22(1):31–72, 2011. 3, 1

  24. [24]

    Learning structured representations with hyperbolic embed- dings.Advances in Neural Information Processing Systems, 37:91220–91259, 2024

    Aditya Sinha, Siqi Zeng, Makoto Yamada, and Han Zhao. Learning structured representations with hyperbolic embed- dings.Advances in Neural Information Processing Systems, 37:91220–91259, 2024. 1

  25. [25]

    Thompson, Eliza- beth G

    Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Eliza- beth G. Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M. Dahdul, Charles Stewart, Tanya Berger- Wolf, Wei-Lun Chao, and Yu Su. Bioclip (revision 7b4abf1),

  26. [26]

    Rare species, 2023

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, 5 Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, Wei-Lun Chao, and Yu Su. Rare species, 2023. 2, 3

  27. [27]

    BioCLIP: A vision founda- tion model for the tree of life

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, Wei-Lun Chao, and Yu Su. BioCLIP: A vision founda- tion model for the tree of life. InCVPR, pages 19412–19424,

  28. [28]

    TreeOfLife-10M (Revision ffa2a31), 2026

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, Wei-Lun Chao, and Yu Su. TreeOfLife-10M (Revision ffa2a31), 2026. 1, 3

  29. [29]

    inat challenge 2021 - fgvc8, 2021

    Grant Van Horn and Oisin Mac Aodha. inat challenge 2021 - fgvc8, 2021. 2, 3

  30. [30]

    Hier- archical multi-label classification networks

    Jonatas Wehrmann, Ricardo Cerri, and Rodrigo Barros. Hier- archical multi-label classification networks. InInternational conference on machine learning, pages 5075–5084. PMLR,

  31. [31]

    Hgclip: Exploring vision- language models with graph representations for hierarchical understanding

    Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, and Zongyuan Ge. Hgclip: Exploring vision- language models with graph representations for hierarchical understanding. InProceedings of the 31st International Con- ference on Computational Linguistics, pages 269–280, 2025. 1

  32. [32]

    Biotrove: A large curated image dataset enabling ai for biodiversity.Ad- vances in Neural Information Processing Systems, 37:102101– 102120, 2024

    Chih-Hsuan Yang, Benjamin Feuer, Talukder Jubery, Zi Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh Singh, et al. Biotrove: A large curated image dataset enabling ai for biodiversity.Ad- vances in Neural Information Processing Systems, 37:102101– 102120, 2024. 1

  33. [33]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 1

  34. [34]

    BioCAP: Exploiting synthetic cap- tions beyond labels in biological foundation models

    Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, and Jianyang Gu. BioCAP: Exploiting synthetic cap- tions beyond labels in biological foundation models. InThe Fourteenth International Conference on Learning Represen- tations, 2026. 1

  35. [35]

    B-CNN: Branch Convolutional Neural Network for Hierarchical Classification

    Xinqi Zhu and Michael Bain. B-cnn: branch convolutional neural network for hierarchical classification.arXiv preprint arXiv:1709.09890, 2017. 1 6 Beyond Flat Labels: Level-Restricted Contrastive Learning for Hierarchical Fine-Grained Vision Classification Supplementary Material

  36. [36]

    Hierarchical Image Classification Hierarchical image classification incorporates taxonomy structure into visual recognition [ 23]

    Related Work 5.1. Hierarchical Image Classification Hierarchical image classification incorporates taxonomy structure into visual recognition [ 23]. Early approaches introduced hierarchical supervision into convolutional net- works to model coarse-to-fine label dependencies [30, 35]. More recent methods integrate hierarchy into deep architec- tures throug...

  37. [37]

    A single shared temperature parameter is used across all taxonomy levels

    Implementation Details We initialize from the BioCLIP ViT-B/16 model [ 25, 27] and fine-tune it for 30 epochs using AdamW with batch size 4096, learning rate 10−4, and weight decay 0.2. A single shared temperature parameter is used across all taxonomy levels. Taxonomic text labels are constructed using cumula- tive taxonomy prefixes; for example, a specie...