Recognition: unknown
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
Pith reviewed 2026-05-10 11:15 UTC · model grok-4.3
The pith
A model trained only on synthetic data performs zero-shot category discovery across vision, text, audio and remote sensing without fine-tuning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniGCD routes inputs through modality-specific encoders, reduces their dimension to form a GCD latent space, and applies a transformer trained exclusively on synthetic data to convert that space into a form better suited for clustering at test time. Trained once on synthetic data, the system then carries out zero-shot generalized category discovery on sixteen datasets across four modalities, yielding higher accuracy on both known and novel classes than prior baselines.
What carries the argument
The synthetically trained Transformer-based model that transforms the dimension-reduced GCD latent space at test time into a representation suitable for clustering
Load-bearing premise
A transformer trained exclusively on synthetic data can turn reduced-dimensional representations from real inputs of any of the four modalities into forms suitable for clustering with no further adaptation or fine-tuning
What would settle it
Evaluating the trained model on additional real datasets from one of the four modalities and checking whether known-class and novel-class accuracies exceed those of standard baselines in the zero-shot setting
Figures
read the original abstract
Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{https://github.com/Jordan-HS/OmniGCD}{here}$
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniGCD, a modality-agnostic approach to Generalized Category Discovery (GCD). Modality-specific encoders produce inputs that are dimension-reduced into a GCD latent space; a Transformer trained exclusively on synthetic data then transforms this space at test time into a representation better suited for clustering known and novel classes. The method is evaluated in a zero-shot GCD protocol (no dataset-specific fine-tuning) across 16 datasets spanning vision, text, audio, and remote sensing, reporting average accuracy gains of +6.2, +17.9, +1.5, and +12.7 percentage points respectively over baselines.
Significance. If the zero-shot synthetic-to-real transfer holds, the work would be significant for decoupling representation learning from category discovery and for establishing a benchmark that allows encoder development to proceed independently of GCD. The emphasis on abstract, human-inspired category formation across modalities could influence future multi-modal discovery methods.
major comments (2)
- [Abstract] Abstract and experimental results: the headline claim of average percentage-point gains (+6.2, +17.9, +1.5, +12.7) supplies no information on the concrete baseline methods, statistical significance, error bars, data splits, or controls for synthetic-data leakage; without these the support for the central empirical claim cannot be assessed.
- [Method] Method section (description of the synthetically trained Transformer): the central assumption that a single Transformer maps dimension-reduced real embeddings from four modalities into a clustering-friendly space without per-dataset adaptation lacks direct validation; experiments demonstrating that the learned mapping is not distribution-specific to the synthetic training data are required.
minor comments (1)
- [Abstract] The code repository link is provided, supporting reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the full manuscript and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental results: the headline claim of average percentage-point gains (+6.2, +17.9, +1.5, +12.7) supplies no information on the concrete baseline methods, statistical significance, error bars, data splits, or controls for synthetic-data leakage; without these the support for the central empirical claim cannot be assessed.
Authors: We agree the abstract is concise and omits these specifics. The full manuscript details the baselines in Section 4.2 (standard GCD methods such as GCD, SimGCD, and ORCA adapted to the zero-shot protocol), reports mean accuracy with standard deviation over five runs using different random seeds, follows the canonical data splits and known/novel class partitions for each of the 16 datasets, and prevents synthetic-data leakage by generating purely synthetic samples via a controlled procedure with no real-data overlap (Section 3.1). We will revise the abstract to briefly reference these elements. revision: partial
-
Referee: [Method] Method section (description of the synthetically trained Transformer): the central assumption that a single Transformer maps dimension-reduced real embeddings from four modalities into a clustering-friendly space without per-dataset adaptation lacks direct validation; experiments demonstrating that the learned mapping is not distribution-specific to the synthetic training data are required.
Authors: The zero-shot results across 16 real datasets from four modalities already provide strong indirect evidence that the mapping generalizes beyond the synthetic distribution, as overfitting to synthetic data would not yield consistent gains on diverse real embeddings. To supply the requested direct validation, we will add an ablation study in the revision that evaluates the Transformer on held-out synthetic distributions and on mixtures of synthetic and real embeddings, confirming the mapping is not distribution-specific. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation consists of an explicit pipeline (modality encoders to dimension-reduced GCD latent space, followed by a Transformer trained exclusively on synthetic data and applied zero-shot) whose central claim is an empirical performance gain on 16 held-out real datasets. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the synthetic training set is distinct from evaluation data, the zero-shot protocol forbids per-dataset adaptation, and no load-bearing uniqueness theorems or ansatzes are invoked. The reported improvements are therefore falsifiable experimental outcomes rather than tautological restatements of the method's inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modality-specific encoders followed by dimension reduction yield a GCD latent space that is amenable to clustering after synthetic Transformer transformation.
invented entities (2)
-
GCD latent space
no independent evidence
-
Synthetically trained Transformer-based model
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Generalized category discov- ery with large language models in the loop
Wenbin An, Wenkai Shi, Feng Tian, Haonan Lin, QianYing Wang, Yaqiang Wu, Mingxiang Cai, Luyan Wang, Yan Chen, Haiping Zhu, and Ping Chen. Generalized category discov- ery with large language models in the loop. InFindings of ACL, 2024. 1, 2, 8
2024
-
[2]
Gregory Ashby and W
F. Gregory Ashby and W. Todd Maddox. Human category learning.Annu. Rev. Psychol., 56(1):149–178, 2005. 1, 3
2005
-
[3]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2
2023
-
[4]
Bellman.Dynamic Programming
R. Bellman.Dynamic Programming. Princeton University Press, 1957. 4
1957
-
[5]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 2, 6, 7, 12, 14, 16, 17
2021
-
[6]
Efficient intent detection with dual sentence encoders
I ˜nigo Casanueva, Tadas Tem ˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. Efficient intent detection with dual sentence encoders. InProceedings of the 2nd Work- shop on Natural Language Processing for Conversational AI, 2020. 6, 7
2020
-
[7]
Promptccd: Learning gaussian mixture prompt pool for con- tinual category discovery
Fernando Julio Cendra, Bingchen Zhao, and Kai Han. Promptccd: Learning gaussian mixture prompt pool for con- tinual category discovery. InEuropean Conference on Com- puter Vision, pages 188–205, 2024. 2
2024
-
[8]
Decision transformer: Reinforce- ment learning via sequence modeling.Adv
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling.Adv. Neural Inf. Pro- cess. Syst., 34:15084–15097, 2021. 2
2021
-
[9]
Remote sens- ing image scene classification: Benchmark and state of the art.Proc
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proc. IEEE, 105(10):1865–1883, 2017. 6, 7
2017
-
[10]
Parametric information max- imization for generalized category discovery
Florent Chiaroni, Jose Dolz, Ziko Imtiaz Masud, Amar Mitiche, and Ismail Ben Ayed. Parametric information max- imization for generalized category discovery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1729–1739, 2023. 2
2023
-
[11]
Contrastive mean- shift learning for generalized category discovery
Sua Choi, Dahyun Kang, and Minsu Cho. Contrastive mean- shift learning for generalized category discovery. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23094–23104, 2024. 2
2024
-
[12]
ImageNet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical im- age database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009. 6, 7
2009
-
[13]
Meg evidence that modality-independent conceptual representations contain se- mantic and visual features.J
Julien Dirani and Liina Pylkk ¨anen. Meg evidence that modality-independent conceptual representations contain se- mantic and visual features.J. Neurosci., 44(27), 2024. 1
2024
-
[14]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations, 2021. 2
2021
-
[15]
On-the-fly cate- gory discovery
Ruoyi Du, Dongliang Chang, Kongming Liang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. On-the-fly cate- gory discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11691– 11700, 2023. 2
2023
-
[16]
Freedman, Maximilian Riesenhuber, Tomaso Pog- gio, and Earl K
David J. Freedman, Maximilian Riesenhuber, Tomaso Pog- gio, and Earl K. Miller. Categorical representation of visual stimuli in the primate prefrontal cortex.Science, 291(5502): 312–316, 2001. 1
2001
-
[17]
Robust semi-supervised learning when not all classes have labels.Adv
Lan-Zhe Guo, Yi-Ge Zhang, Zhi-Fan Wu, Jie-Jing Shao, and Yu-Feng Li. Robust semi-supervised learning when not all classes have labels.Adv. Neural Inf. Process. Syst., 35:3305– 3317, 2022. 2
2022
-
[18]
Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 2019. 6, 7
2019
-
[19]
Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V
Andrew G. Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V . Le, and Hartwig Adam. Searching for MobileNetV3. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[20]
A review of t-sne.Image Processing On Line, 14:250–270, 2024
Sangwon Jung, Tristan Dagobert, Jean-Michel Morel, and Gabriele Facciolo. A review of t-sne.Image Processing On Line, 14:250–270, 2024. 12
2024
-
[21]
Yaelan Jung, Bart Larsen, and Dirk B. Walther. Modality- independent coding of scene categories in prefrontal cortex. J. Neurosci., 38(26):5969–5981, 2018. 1
2018
-
[22]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. 6
2015
-
[23]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 554–561, 2013. 6, 7, 17
2013
-
[24]
Learning multiple layers of features from tiny images, 2009
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009. Technical report, University of Toronto. 6, 7, 17
2009
-
[25]
Peper, Christo- pher Clarke, Andrew Lee, Parker Hill, Jonathan K
Stefan Larson, Anish Mahendran, Joseph J. Peper, Christo- pher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummer- feld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classifica- tion and out-of-scope prediction. InProceedings of EMNLP- IJCNLP, 2019. 6, 7
2019
-
[26]
Mert: Acoustic music understanding model with large-scale self-supervised train- ing
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. Mert: Acoustic music understanding model with large-scale self-supervised train- ing. InInternational Conference on Learning Representa- tions, 2024. 6, 7
2024
-
[27]
Zhang, Deva Ramanan, and Shubham Tulsiani
Amy Lin, Jason Y . Zhang, Deva Ramanan, and Shubham Tulsiani. Relpose++: Recovering 6d poses from sparse-view observations. In2024 Int. Conf. 3D Vis. (3DV), 2024. 3
2024
-
[28]
Open-world semi-supervised novel class discovery.arXiv preprint arXiv:2305.13095,
Jiaming Liu, Yangqiming Wang, Tongze Zhang, Yulu Fan, Qinli Yang, and Junming Shao. Open-world semi-supervised novel class discovery.arXiv preprint arXiv:2305.13095,
-
[29]
DebGCD: Debiased learning with distribution guidance for generalized category discov- ery
Yuanpei Liu and Kai Han. DebGCD: Debiased learning with distribution guidance for generalized category discov- ery. InInternational Conference on Learning Representa- tions, 2025. 2
2025
-
[30]
Hyperbolic cate- gory discovery
Yuanpei Liu, Zhenqi He, and Kai Han. Hyperbolic cate- gory discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9891– 9900, 2025. 8
2025
-
[31]
S. Lloyd. Least squares quantization in PCM.IEEE Trans. Inf. Theory, 1982. 7
1982
-
[32]
Happy: A debiased learn- ing framework for continual generalized category discovery
Shijie Ma, Fei Zhu, Zhun Zhong, Wenzhuo Liu, Xu-Yao Zhang, and Cheng-Lin Liu. Happy: A debiased learn- ing framework for continual generalized category discovery. Adv. Neural Inf. Process. Syst., 37:50850–50875, 2024. 2
2024
-
[33]
Active generalized category discovery
Shijie Ma, Fei Zhu, Zhun Zhong, Xu-Yao Zhang, and Cheng- Lin Liu. Active generalized category discovery. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16890–16900, 2024. 2
2024
-
[34]
S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft, 2013. 6, 7, 14, 17
2013
-
[35]
Umap: Uniform manifold approximation and projection.J
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approximation and projection.J. Open Source Softw., 3(29):861, 2018. 5, 8, 16, 17
2018
-
[36]
Miller and Jonathan D
Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function.Annu. Rev. Neurosci., 24(1): 167–202, 2001. 1
2001
-
[37]
Miller, David J
Earl K. Miller, David J. Freedman, and Jonathan D. Wallis. The prefrontal cortex: Categories, concepts and cognition. Philos. Trans. R. Soc. Lond. B Biol. Sci., 357(1424):1123– 1136, 2002. 1
2002
-
[38]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herv ´e J ´egou, Julien Mairal, P...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Guided cluster aggregation: A hierarchical approach to generalized category discovery
Jona Otholt, Christoph Meinel, and Haojin Yang. Guided cluster aggregation: A hierarchical approach to generalized category discovery. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2618–2627, 2024. 2
2024
-
[40]
Karl Pearson. Liii. on lines and planes of closest fit to sys- tems of points in space.Philosophical Magazine, 2(11):559– 572, 1901. 5, 8, 16, 17
1901
-
[41]
Language models are unsuper- vised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. OpenAI technical report. 3
-
[42]
Asano, Hazel Doughty, and Cees G
Sarah Rastegar, Mohammadreza Salehi, Yuki M. Asano, Hazel Doughty, and Cees G. M. Snoek. Selex: Self-expertise in fine-grained generalized category discovery. InEuropean Conference on Computer Vision, pages 440–458, 2024. 8
2024
-
[43]
A dataset and taxonomy for urban sound research
Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. InACMMM, pages 1041–1044, 2014. 6, 7
2014
-
[44]
Hunter, Costas Bekas, and Al- pha A
Philippe Schwaller, Teodoro Laino, Th ´eophile Gaudin, Pe- ter Bolgar, Christopher A. Hunter, Costas Bekas, and Al- pha A. Lee. Molecular transformer: A model for uncertainty- calibrated chemical reaction prediction.ACS Central Sci- ence, 5(9):1572–1583, 2019. 2
2019
-
[45]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 7, 8, 12, 14, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B
Samarth Sinha, Jason Y . Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B. Lindell. Sparsepose: Sparse- view camera pose regression and refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21349–21359, 2023. 3
2023
-
[47]
The herbarium challenge 2019 dataset
Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa Tulig, and Serge Belongie. The herbarium challenge 2019 dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019. 6, 7, 17
2019
-
[48]
Aden: Adaptive density representations for sparse-view camera pose estimation
Hao Tang, Weiyao Wang, Pierre Gleize, and Matt Feis- zli. Aden: Adaptive density representations for sparse-view camera pose estimation. InEuropean Conference on Com- puter Vision, 2024. 3
2024
-
[49]
Dissecting generalized category discovery: Multiplex con- sensus under self-deconstruction
Luyao Tang, Kunze Huang, Chaoqi Chen, Yuxuan Yuan, Chenxin Li, Xiaotong Tu, Xinghao Ding, and Yue Huang. Dissecting generalized category discovery: Multiplex con- sensus under self-deconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision,
-
[50]
Visualizing data using t-sne.J
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.J. Mach. Learn. Res., 9:2579–2605, 2008. 5, 8, 16, 17
2008
-
[51]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Adv. Neural Inf. Pro- cess. Syst., 30, 2017. 4
2017
-
[52]
Generalized category discovery
Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser- man. Generalized category discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 1, 2, 3, 6, 7, 8, 15, 17
2022
-
[53]
No representation rules them all in category discovery.Adv
Sagar Vaze, Andrea Vedaldi, and Andrew Zisserman. No representation rules them all in category discovery.Adv. Neu- ral Inf. Process. Syst., 36:19962–19989, 2023. 1, 2
2023
-
[54]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technol- ogy, 2011. 6, 7, 17
2011
-
[55]
Get: Unlocking the multi-modal potential of clip for generalized category dis- covery
Enguang Wang, Zhimao Peng, Zhengyuan Xie, Fei Yang, Xialei Liu, and Ming-Ming Cheng. Get: Unlocking the multi-modal potential of clip for generalized category dis- covery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20296– 20306, 2025. 2
2025
-
[56]
Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment
Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,
-
[57]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training.arXiv preprint arXiv:2212.03533, 2022. 6, 7, 8
work page internal anchor Pith review arXiv 2022
-
[58]
Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Adv
Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.Adv. Neural Inf. Process. Syst., 35:3502–3516, 2022. 2
2022
-
[59]
Parametric classification for generalized category discovery: A baseline study
Xin Wen, Bingchen Zhao, and Xiaojuan Qi. Parametric classification for generalized category discovery: A baseline study. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16590–16600, 2023. 2, 8
2023
-
[60]
V ocalset: A singing voice dataset, 2018
Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. V ocalset: A singing voice dataset, 2018. 6, 7
2018
-
[61]
Metagcd: Learning to continually learn in generalized cat- egory discovery
Yanan Wu, Zhixiang Chi, Yang Wang, and Songhe Feng. Metagcd: Learning to continually learn in generalized cat- egory discovery. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 1655–1665,
-
[62]
Deep active learning in the open world.arXiv preprint arXiv:2411.06353, 2024
Tian Xie, Jifan Zhang, Haoyue Bai, and Robert Nowak. Deep active learning in the open world.arXiv preprint arXiv:2411.06353, 2024. 2
-
[63]
Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024. 6, 7
-
[64]
Short text clustering via convolutional neural networks
Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. Short text clustering via convolutional neural networks. InProceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 62–69, 2015. 6, 7
2015
-
[65]
Generalized category discovery for remote sensing image scene classification
Wenjia Xu, Zijian Yu, Zhiwei Wei, Jiuniu Wang, and Mugen Peng. Generalized category discovery for remote sensing image scene classification. InIGARSS 2024, pages 8731–
2024
-
[66]
Consistent prompt tuning for generalized category discovery.IJCV, pages 1–28, 2025
Muli Yang, Jie Yin, Yanan Gu, Cheng Deng, Hanwang Zhang, and Hongyuan Zhu. Consistent prompt tuning for generalized category discovery.IJCV, pages 1–28, 2025. 2
2025
-
[67]
Bag-of-visual-words and spa- tial extensions for land-use classification
Yi Yang and Shawn Newsam. Bag-of-visual-words and spa- tial extensions for land-use classification. InProceedings of the 18th SIGSPATIAL Int. Conf. Advances in Geographic In- formation Systems, pages 270–279, 2010. 6, 7
2010
-
[68]
Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. Graph transformer networks.Adv. Neural Inf. Process. Syst., 32, 2019. 2
2019
-
[69]
Learning semi- supervised gaussian mixture models for generalized category discovery
Bingchen Zhao, Xin Wen, and Kai Han. Learning semi- supervised gaussian mixture models for generalized category discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16623–16633, 2023. 2
2023
-
[70]
Prototypical hash encoding for on-the-fly fine- grained category discovery.Adv
Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, and Zhun Zhong. Prototypical hash encoding for on-the-fly fine- grained category discovery.Adv. Neural Inf. Process. Syst., 37:101428–101455, 2024. 2
2024
-
[71]
Textual knowledge matters: Cross-modality co- teaching for generalized visual class discovery
Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, and Zhun Zhong. Textual knowledge matters: Cross-modality co- teaching for generalized visual class discovery. InEuropean Conference on Computer Vision, pages 41–58, 2024. 2
2024
-
[72]
Generalized category discovery in aerial image classification via slot attention.Drones, 8(4): 160, 2024
Yifan Zhou, Haoran Zhu, Yan Zhang, Shuo Liang, Yujing Wang, and Wen Yang. Generalized category discovery in aerial image classification via slot attention.Drones, 8(4): 160, 2024. 2
2024
-
[73]
So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets].IEEE Geosci
Xiao Xiang Zhu, Jingliang Hu, Chunping Qiu, Yilei Shi, Jian Kang, Lichao Mou, Hossein Bagheri, Matthias Haberle, Yuansheng Hua, Rong Huang, Lloyd Hughes, Hao Li, Yao Sun, Guichen Zhang, Shiyao Han, Michael Schmitt, and Yuanyuan Wang. So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets].IEEE Geosc...
-
[74]
GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback
Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, and Hang Su. Glean: Generalized cate- gory discovery with diverse and quality-enhanced llm feed- back.arXiv preprint arXiv:2502.18414, 2025. 1, 2, 8 Supplementary Material for “OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism” In our supplementary mater...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
In Figure 3, we present qualitative examples illustrating OmniGCD’s ability to enhance GCD latent spaces
GCDFormer Optimization of GCD Latent Spaces The GCDformer in OmniGCD is trained exclusively on synthetic data to optimize the GCD latent space for cluster- ing. In Figure 3, we present qualitative examples illustrating OmniGCD’s ability to enhance GCD latent spaces. These examples comprise randomly selected synthetic GCD latent spaces from GCDformer’s tra...
-
[76]
As noted, at test time, OmniGCD’s input comprises both the labeled and unlabeled subsets
Details regarding OmniGCDs Zero-shot setup In this section, we provide additional details on the sizes of the labeled and unlabeled datasets used by OmniGCD for the main results in Table 4 of the main paper. As noted, at test time, OmniGCD’s input comprises both the labeled and unlabeled subsets. In the zero-shot GCD setting, the labeled subset is constru...
-
[77]
In this section, we present ablations on train- ing with higher-dimensional data (specifically, 32, 64, and 128 dimensions)
OmniGCD Dimension Ablations In the main paper, we chose 2D data for GCDformer train- ing, as low dimensionality ensures tractability in generating training data. In this section, we present ablations on train- ing with higher-dimensional data (specifically, 32, 64, and 128 dimensions). Due to theO(N 2k)time complexity of t-SNE [20]—whereNis the number of ...
-
[78]
Additional Vision Encoders Ablations Here, we present results using two additional pretrained vi- sion encoders: MobileNetV3 [19] and DINOv3 [45]. These results are not merely to demonstrate that OmniGCD works with additional encoders, as the main results with multi- modal encoders already establish this, but serve two spe- cific purposes: (1) MobileNetV3...
-
[79]
These re- sults are presented in Table 13
Further analysis of OmniGCD Zero-shot GCD performance In this section, we provide a more granular analysis of Om- niGCD’s performance, along with standard deviations for the main results in Table 4 of the main paper. These re- sults are presented in Table 13. For the granular analysis, we compare the individual contributions of t-SNE dimen- sionality redu...
-
[80]
As noted in the main paper, this vision encoder was fine-tuned using GCD’s [52] super- vised and self-supervised contrastive training methods
Additional Results for Vision Encoder Fine-tuning Here, we provide additional results for Table 5 in the main paper, focusing on OmniGCD’s performance with a fine- tuned DINOv1 [5] encoder. As noted in the main paper, this vision encoder was fine-tuned using GCD’s [52] super- vised and self-supervised contrastive training methods. The results in Table 12 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.