pith. sign in

arxiv: 2606.24094 · v1 · pith:HCVPUUPYnew · submitted 2026-06-23 · 💻 cs.CV

Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent

Pith reviewed 2026-06-26 01:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords image clusteringLLM agenttextual guidelinesgenerative concept proxyminimum spanning tree traversaluniversal frameworkcomputer vision
0
0 comments X

The pith

A hybrid LLM agent uses textual guidelines and generative proxies to cluster images across any scenario without task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a universal framework for image clustering that accepts textual guidelines describing the desired grouping rules. It replaces separate specialized models with one agent that extracts concept proxies to embed images in a guideline-aware way and uses minimum spanning tree traversal to decide when to invoke LLM reasoning for hard cases. The method is shown to handle shifts from broad categories to fine-grained ones, from global to local features, and from balanced to long-tail class sizes. A reader would care because current practice requires designing or fine-tuning a new model for each new clustering goal; this approach claims to remove that requirement.

Core claim

The Guideline-Driven Image Clustering Agent is the first framework that unifies image clustering across fundamentally different tasks by ingesting textual guidelines, generating guideline-aware embeddings through Generative Concept Proxy Modeling that extracts concept proxies without any task-specific training, and applying LLM Traversal based on Minimum Spanning Tree to selectively invoke reasoning only for complex semantic judgments, thereby generalizing from general to fine-grained categorization, global to local criteria, and balanced to long-tail distributions while outperforming specialized methods.

What carries the argument

Generative Concept Proxy Modeling, which produces guideline-aware embeddings by extracting concept proxies from the input textual guidelines so that a single embedding space serves many different clustering rules.

If this is right

  • Any new clustering goal can be addressed by writing a fresh textual guideline rather than collecting labels or retraining a model.
  • The same trained components remain usable when the balance of classes changes or when criteria shift from global appearance to local detail.
  • LLM calls occur only on selected edges of the minimum spanning tree, limiting expensive reasoning to the hardest decisions.
  • The framework can replace multiple narrow clustering pipelines in applications that encounter changing user-defined grouping rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the proxy extraction step proves robust, similar guideline-driven proxies might be tested on other embedding-based tasks such as retrieval or few-shot classification.
  • The selective use of LLM traversal suggests a broader pattern for hybrid systems that reserve expensive reasoning for graph edges where cheaper distances disagree with semantic cues.
  • Long-term deployment would require checking whether repeated guideline changes gradually degrade the fixed embedding space.

Load-bearing premise

That concept proxies generated from guidelines will produce embeddings effective for fundamentally different clustering rules without any task-specific training or fine-tuning.

What would settle it

Run the method on a new image dataset whose clustering rule (for example, grouping by subtle lighting direction) is described only in a guideline never used in its development and measure whether accuracy falls below a task-specific baseline trained directly on that rule.

Figures

Figures reproduced from arXiv: 2606.24094 by Feng Jiang, Hehuan Ma, Junzhou Huang, Karim Bouyarmane, Kushal Kumar, Lucas Goncalves, Rob Barton, Vidit Bansal, Wenliang Zhong, Yuzhi Guo.

Figure 1
Figure 1. Figure 1: Overview of our Guideline-Driven Clustering Agent. We introduce the first universal clustering framework that handles diverse image clustering scenarios through textual guidelines, spanning from general to fine-grained tasks, from global to local criteria, and from balanced to long-tail distributions. Our training-free hybrid agent flexibly adapts across these diverse clustering requirements. Abstract Unif… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Clustering Framework. Bottom left: Generative Concept Proxy Modeling extracts guideline-aware textual descriptions from images via multimodal LLM and encodes them into embeddings for efficient clustering; Right: MST-based LLM Traver￾sal refines initial clusters by constructing a Minimum Spanning Tree and selectively querying the LLM for semantic merging decisions. Top left: For scenarios wi… view at source ↗
Figure 3
Figure 3. Figure 3: Examples from ABO-LC. Each item is represented by [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A disentanglement test case of cards grouped by suits using HDBSCAN. While using GME-Qwen with images only, the number criteria dominates the suit criteria because of the layout. NMI for Fruit species), while criteria with clear visual pat￾terns show modest gains. Interestingly, improvements in FC tasks ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of LLM calls used for clustering and their [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Merging rates over four iterations. The running merging [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Raw sample examples from the original ABO dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: the forwarding process of E5-Mistral. The input contains the instruction, the GCPM caption, and an EOS token, which are concatenated and fed to the LLM. The output EOS token embedding is used for representation. Right: for GME-QWen, the input includes the original image, which are processed to visual tokens by a visual encoder and a projector. ticularly demanding for clustering methods. Explicit guid… view at source ↗
Figure 9
Figure 9. Figure 9: Guideline Generation Process for the Stanford Dogs dataset [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a Guideline-Driven Image Clustering Agent as the first universal framework for image clustering that incorporates textual guidelines. It proposes Generative Concept Proxy Modeling to produce guideline-aware embeddings without task-specific training or fine-tuning, and LLM Traversal based on Minimum Spanning Tree for automatic cluster discovery in scenarios requiring semantic judgments. The method is claimed to generalize across scenarios spanning general to fine-grained categorization, global to local criteria, and balanced to long-tail distributions, with consistent outperformance over specialized methods.

Significance. If the empirical results hold, the work could be significant for offering a training-free, guideline-driven unification of image clustering tasks that have previously required specialized models. The hybrid LLM agent design, particularly the concept proxy extraction and MST-based traversal, represents a novel direction. However, the complete absence of quantitative results, baselines, datasets, or experimental protocol in the abstract prevents any assessment of whether the claimed generality across fundamentally different criteria is achieved.

major comments (2)
  1. [Abstract] Abstract: the central claim that the framework 'consistently outperforms specialized methods across diverse clustering tasks' is presented with no quantitative results, baselines, experimental protocol, or tables, rendering the universality assertion unverifiable and load-bearing for the paper's contribution.
  2. [Method] Method description (Generative Concept Proxy Modeling): no formal guarantee, parameter-free derivation, or ablation is supplied showing that LLM-generated concept proxies preserve the necessary distinctions for local vs. global or balanced vs. long-tail criteria without task-specific adaptation; this directly bears on the skeptic's concern that the proxy mechanism may fail to transfer across guideline types.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key performance metric or dataset name to support the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the framework 'consistently outperforms specialized methods across diverse clustering tasks' is presented with no quantitative results, baselines, experimental protocol, or tables, rendering the universality assertion unverifiable and load-bearing for the paper's contribution.

    Authors: We agree that the abstract's strong claim would benefit from additional context to improve verifiability. In the revised version we will expand the abstract to include a concise statement of the experimental protocol (e.g., the range of datasets and guideline types evaluated) together with a high-level summary of the quantitative gains (average improvement margins over the strongest specialized baselines). The full tables, baselines, datasets, and protocol remain in the Experiments section; the abstract revision will simply make the universality claim traceable without exceeding typical length constraints. revision: yes

  2. Referee: [Method] Method description (Generative Concept Proxy Modeling): no formal guarantee, parameter-free derivation, or ablation is supplied showing that LLM-generated concept proxies preserve the necessary distinctions for local vs. global or balanced vs. long-tail criteria without task-specific adaptation; this directly bears on the skeptic's concern that the proxy mechanism may fail to transfer across guideline types.

    Authors: The Generative Concept Proxy Modeling is deliberately parameter-free by construction: it extracts concept proxies directly from the LLM's frozen reasoning without any gradient updates or task-specific fine-tuning. While a closed-form theoretical guarantee is difficult to obtain for black-box LLMs, the method's transferability is supported by the empirical results across the paper's diverse guideline regimes. To directly address the concern, we will add a targeted ablation study in the revision that isolates performance on local-versus-global and balanced-versus-long-tail guideline subsets, thereby providing concrete evidence of cross-criterion robustness. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper introduces an empirical framework (Guideline-Driven Image Clustering Agent with Generative Concept Proxy Modeling and LLM Traversal) whose central claims rest on experimental outperformance across tasks rather than any mathematical derivation, equation, or parameter fit. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The generality claim is presented as a design property of the method, not derived from prior self-work or ansatz smuggling. The result is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5667 in / 1014 out tokens · 21425 ms · 2026-06-26T01:17:43.914720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 7 linked inside Pith

  1. [1]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2, 3

  2. [2]

    Entity-based cross- document coreferencing using the vector space model

    Amit Bagga and Breck Baldwin. Entity-based cross- document coreferencing using the vector space model. In COLING 1998 Volume 1: The 17th international conference on computational linguistics, 1998. 7

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5

  4. [4]

    Onegan: Simultaneous unsuper- vised learning of conditional image generation, foreground segmentation, and fine-grained clustering

    Yaniv Benny and Lior Wolf. Onegan: Simultaneous unsuper- vised learning of conditional image generation, foreground segmentation, and fine-grained clustering. InEuropean Con- ference on Computer Vision, pages 514–530. Springer, 2020. 2, 5, 7, 10

  5. [5]

    Ergun, Chen Wang, and Sam- son Zhou

    Vladimir Braverman, Jon C. Ergun, Chen Wang, and Sam- son Zhou. Learning-augmented hierarchical clustering. In Forty-second International Conference on Machine Learn- ing, 2025. 2

  6. [6]

    Density-based clustering based on hierarchical density esti- mates

    Ricardo JGB Campello, Davoud Moulavi, and J ¨org Sander. Density-based clustering based on hierarchical density esti- mates. InPacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer, 2013. 2, 4

  7. [7]

    Global-local dirichlet processes for clustering grouped data in the presence of group-specific idiosyncratic variables

    Arhit Chakrabarti, Yang Ni, Debdeep Pati, and Bani Mallick. Global-local dirichlet processes for clustering grouped data in the presence of group-specific idiosyncratic variables. In Forty-second International Conference on Machine Learn- ing, 2025. 2

  8. [8]

    Incremental clustering and dynamic information retrieval

    Moses Charikar, Chandra Chekuri, Tom ´as Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. InProceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 626–635, 1997. 5

  9. [9]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 5, 7, 10

  10. [10]

    Infogan: Interpretable rep- resentation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable rep- resentation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016. 5, 7, 10

  11. [11]

    Agent-centric personalized mul- tiple clustering with multi-modal llms.arXiv preprint arXiv:2503.22241, 2025

    Ziye Chen, Yiqun Duan, Riheng Zhu, Zhenbang Sun, and Mingming Gong. Agent-centric personalized mul- tiple clustering with multi-modal llms.arXiv preprint arXiv:2503.22241, 2025. 2, 7

  12. [12]

    Agent-centric personalized multiple clus- tering with multi-modal llms, 2025

    Ziye Chen, Yiqun Duan, Riheng Zhu, Zhenbang Sun, and Mingming Gong. Agent-centric personalized multiple clus- tering with multi-modal llms, 2025. 8

  13. [13]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 5, 4

  14. [14]

    Abo: Dataset and benchmarks for real-world 3d object un- derstanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 5, 3

  15. [15]

    Nearest neighbor matching for deep clustering

    Zhiyuan Dang, Cheng Deng, Xu Yang, Kun Wei, and Heng Huang. Nearest neighbor matching for deep clustering. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13693–13702, 2021. 5, 6, 9

  16. [16]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 4

  17. [17]

    Information-theoretic generative clustering of documents, 2024

    Xin Du and Kumiko Tanaka-Ishii. Information-theoretic generative clustering of documents, 2024. 2, 9

  18. [18]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, pages 226–231,

  19. [19]

    Mark: Multi-agent collabora- tion with ranking guidance for text-attributed graph cluster- ing

    Yiwei Fu, Yuxing Zhang, Chunchun Chen, JianwenMa Jian- wenMa, Quan Yuan, Rong-Cheng Tu, Xinli Huang, Wei Ye, Xiao Luo, and Minghua Deng. Mark: Multi-agent collabora- tion with ranking guidance for text-attributed graph cluster- ing. InFindings of the Association for Computational Lin- guistics: ACL 2025, pages 6057–6072, 2025. 2, 4, 8

  20. [20]

    Personalized clustering via targeted representation learning

    Xiwen Geng, Suyun Zhao, Yixin Yu, Borui Peng, Pan Du, Hong Chen, Cuiping Li, and Mengdie Wang. Personalized clustering via targeted representation learning. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 16790–16798, 2025. 2, 3

  21. [21]

    Cards image dataset-classification

    gpiosenka. Cards image dataset-classification. Kaggle dataset. Accessed: 2025-09-07. 3

  22. [22]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 5, 6, 9

  23. [23]

    Improving image cluster- ing with multiple pretrained cnn feature extractors.arXiv preprint arXiv:1807.07760, 2018

    Joris Gu ´erin and Byron Boots. Improving image cluster- ing with multiple pretrained cnn feature extractors.arXiv preprint arXiv:1807.07760, 2018. 5, 7, 10

  24. [24]

    Task-aware clustering for prompting vision-language models

    Fusheng Hao, Fengxiang He, Fuxiang Wu, Tichao Wang, Chengqun Song, and Jun Cheng. Task-aware clustering for prompting vision-language models. InProceedings of the 9 Computer Vision and Pattern Recognition Conference, pages 14745–14755, 2025. 2

  25. [25]

    Algorithm as 136: A k-means clustering algorithm.Journal of the royal sta- tistical society

    John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal sta- tistical society. series c (applied statistics), 28(1):100–108,

  26. [26]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 5, 7, 10

  27. [27]

    Finding multiple stable clusterings.Knowledge and Infor- mation Systems, 51(3):991–1021, 2017

    Juhua Hu, Qi Qian, Jian Pei, Rong Jin, and Shenghuo Zhu. Finding multiple stable clusterings.Knowledge and Infor- mation Systems, 51(3):991–1021, 2017. 5, 7, 10

  28. [28]

    Deep se- mantic clustering by partition confidence maximisation

    Jiabo Huang, Shaogang Gong, and Xiatian Zhu. Deep se- mantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8849–8858, 2020. 5, 6, 9

  29. [29]

    Learning representation for clustering via prototype scattering and positive sampling.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 45(6):7509–7524,

    Zhizhong Huang, Jie Chen, Junping Zhang, and Hongming Shan. Learning representation for clustering via prototype scattering and positive sampling.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 45(6):7509–7524,

  30. [30]

    Invariant in- formation clustering for unsupervised image classification and segmentation

    Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in- formation clustering for unsupervised image classification and segmentation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9865–9874,

  31. [31]

    E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024. 3

  32. [32]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024. 3

  33. [33]

    Zerodl: Zero-shot distribution learning for text clustering via large language models.arXiv preprint arXiv:2406.13342, 2024

    Hwiyeol Jo, Hyunwoo Lee, Kang Min Yoo, and Tai- woo Park. Zerodl: Zero-shot distribution learning for text clustering via large language models.arXiv preprint arXiv:2406.13342, 2024. 2, 4, 8

  34. [34]

    Towards better-than-2 approximation for constrained corre- lation clustering

    Andreas Kalavas, Evangelos Kipouridis, and Nithin Varma. Towards better-than-2 approximation for constrained corre- lation clustering. InForty-second International Conference on Machine Learning, 2025. 2

  35. [35]

    Novel dataset for fine-grained image categorization

    Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. InFirst Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011. 3, 5

  36. [36]

    Contrastive fine-grained class clustering via generative adversarial networks.arXiv preprint arXiv:2112.14971, 2021

    Yunji Kim and Jung-Woo Ha. Contrastive fine-grained class clustering via generative adversarial networks.arXiv preprint arXiv:2112.14971, 2021. 2, 5, 7, 10

  37. [37]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5

  38. [38]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 5, 4

  39. [39]

    Image clustering con- ditioned on text criteria.arXiv preprint arXiv:2310.18297,

    Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. Image clustering con- ditioned on text criteria.arXiv preprint arXiv:2310.18297,

  40. [40]

    Dual mutual infor- mation constraints for discriminative clustering

    Hongyu Li, Lefei Zhang, and Kehua Su. Dual mutual infor- mation constraints for discriminative clustering. InProceed- ings of the AAAI conference on artificial intelligence, pages 8571–8579, 2023. 6, 9

  41. [41]

    Prototypical contrastive learning of unsupervised representa- tions.arXiv preprint arXiv:2005.04966, 2020

    Junnan Li, Pan Zhou, Caiming Xiong, and Steven CH Hoi. Prototypical contrastive learning of unsupervised representa- tions.arXiv preprint arXiv:2005.04966, 2020. 5, 6, 9

  42. [42]

    Mixnmatch: Multifactor disentanglement and encoding for conditional image generation

    Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8039–8048, 2020. 5, 7, 10

  43. [43]

    Contrastive clustering

    Yunfan Li, Peng Hu, Zitao Liu, Dezhong Peng, Joey Tianyi Zhou, and Xi Peng. Contrastive clustering. InProceedings of the AAAI conference on artificial intelligence, pages 8547– 8555, 2021. 5, 6, 9

  44. [44]

    Twin contrastive learning for online clustering.International Journal of Computer Vision, 130 (9):2205–2221, 2022

    Yunfan Li, Mouxing Yang, Dezhong Peng, Taihao Li, Jiantao Huang, and Xi Peng. Twin contrastive learning for online clustering.International Journal of Computer Vision, 130 (9):2205–2221, 2022. 5, 6, 9

  45. [45]

    Image clustering with external guidance

    Yunfan Li, Peng Hu, Dezhong Peng, Jiancheng Lv, Jianping Fan, and Xi Peng. Image clustering with external guidance. arXiv preprint arXiv:2310.11989, 2023. 2, 6, 7

  46. [46]

    Learning from sample stability for deep clustering

    Zhixin Li, Yuheng Jia, Junhui Hou, et al. Learning from sample stability for deep clustering. InForty-second Inter- national Conference on Machine Learning. 1, 2, 3, 5, 6

  47. [47]

    Spill: Domain-adaptive intent clustering based on selection and pooling with large language models.arXiv preprint arXiv:2503.15351, 2025

    I-Fan Lin, Faegheh Hasibi, and Suzan Verberne. Spill: Domain-adaptive intent clustering based on selection and pooling with large language models.arXiv preprint arXiv:2503.15351, 2025. 2, 4, 8

  48. [48]

    Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024. 3

  49. [49]

    Deepseek-v3 technical report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 7

  50. [50]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 8

  51. [51]

    Interactive deep clustering via value mining

    Honglin Liu, Peng Hu, Changqing Zhang, Yunfan Li, and Xi Peng. Interactive deep clustering via value mining. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1, 2, 5, 6

  52. [52]

    Conditional representation learning for customized tasks

    Honglin Liu, Chao Sun, Peng Hu, Yunfan Li, and Xi Peng. Conditional representation learning for customized tasks. Advances in Neural Information Processing Systems, 38: 31706–31737, 2026. 5, 7

  53. [53]

    Llm-guided 10 semantic-aware clustering for topic modeling

    Jianghan Liu, Ziyu Shang, Wenjun Ke, Peng Wang, Zhizhao Luo, Jiajun Liu, Guozheng Li, and Yining Li. Llm-guided 10 semantic-aware clustering for topic modeling. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18420–18435, 2025. 2, 4, 8

  54. [54]

    Organizing unstructured im- age collections using natural language.arXiv preprint arXiv:2410.05217, 2024

    Mingxuan Liu, Zhun Zhong, Jun Li, Gianni Franchi, Sub- hankar Roy, and Elisa Ricci. Organizing unstructured im- age collections using natural language.arXiv preprint arXiv:2410.05217, 2024. 3

  55. [55]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4015–4025, 2025. 3

  56. [56]

    Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3, 5

  57. [57]

    Multivariate observations

    J MacQueen. Multivariate observations. InProceedings ofthe 5th Berkeley Symposium on Mathematical Statisticsand Probability, pages 281–297, 1967. 1

  58. [58]

    Divclust: Controlling diversity in deep clus- tering

    Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, and Ioannis Patras. Divclust: Controlling diversity in deep clus- tering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3418–3428,

  59. [59]

    Deep embedded non- redundant clustering

    Lukas Miklautz, Dominik Mautz, Muzaffer Can Altinigneli, Christian B ¨ohm, and Claudia Plant. Deep embedded non- redundant clustering. InProceedings of the AAAI conference on artificial intelligence, pages 5174–5181, 2020. 7, 10

  60. [60]

    Segment any cell: A sam-based auto- prompting fine-tuning framework for nuclei segmentation

    Saiyang Na, Yuzhi Guo, Feng Jiang, Hehuan Ma, Jean Gao, and Junzhou Huang. Segment any cell: A sam-based auto- prompting fine-tuning framework for nuclei segmentation. IEEE Transactions on Neural Networks and Learning Sys- tems, 2025. 2

  61. [61]

    Forensic self-descriptions are all you need for zero-shot de- tection, open-set source attribution, and clustering of ai- generated images

    Tai D Nguyen, Aref Azizpour, and Matthew C Stamm. Forensic self-descriptions are all you need for zero-shot de- tection, open-set source attribution, and clustering of ai- generated images. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3040–3050,

  62. [62]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIn- dian Conference on Computer Vision, Graphics and Image Processing, 2008. 3, 5

  63. [63]

    Spice: Seman- tic pseudo-labeling for image clustering.IEEE Transactions on Image Processing, 31:7264–7278, 2022

    Chuang Niu, Hongming Shan, and Ge Wang. Spice: Seman- tic pseudo-labeling for image clustering.IEEE Transactions on Image Processing, 31:7264–7278, 2022. 5, 6, 9

  64. [64]

    Rapid se- lection and ordering of in-context demonstrations via prompt embedding clustering

    Kha Pham, Hung Le, Man Ngo, and Truyen Tran. Rapid se- lection and ordering of in-context demonstrations via prompt embedding clustering. InThe Thirteenth International Con- ference on Learning Representations, 2025. 2

  65. [65]

    Control-oriented clustering of visual latent representation.arXiv preprint arXiv:2410.05063, 2024

    Han Qi, Haocheng Yin, and Heng Yang. Control-oriented clustering of visual latent representation.arXiv preprint arXiv:2410.05063, 2024. 2

  66. [66]

    Stable cluster discrimination for deep clustering

    Qi Qian. Stable cluster discrimination for deep clustering. InProceedings of the IEEE/CVF international conference on computer vision, pages 16645–16654, 2023. 1, 5, 6, 7, 9, 10

  67. [67]

    A diversified attention model for interpretable multiple clusterings.IEEE Trans- actions on Knowledge and Data Engineering, 35(9):8852– 8864, 2022

    Liangrui Ren, Guoxian Yu, Jun Wang, Lei Liu, Carlotta Domeniconi, and Xiangliang Zhang. A diversified attention model for interpretable multiple clusterings.IEEE Trans- actions on Knowledge and Data Engineering, 35(9):8852– 8864, 2022. 5, 7, 10

  68. [68]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 5, 7, 10

  69. [69]

    You never cluster alone.Advances in Neural Information Processing Systems, 34:27734–27746,

    Yuming Shen, Ziyi Shen, Menghan Wang, Jie Qin, Philip Torr, and Ling Shao. You never cluster alone.Advances in Neural Information Processing Systems, 34:27734–27746,

  70. [70]

    Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery

    Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6490–6499, 2019. 5, 7, 10

  71. [71]

    Fixmatch: Simplifying semi-supervised learning with consistency and confidence

    Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596– 608, 2020. 5, 6, 9

  72. [72]

    One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022. 2, 5, 7

  73. [73]

    Clustering- friendly representation learning via instance discrimination and feature decorrelation.arXiv preprint arXiv:2106.00131,

    Yaling Tao, Kentaro Takagi, and Kouta Nakata. Clustering- friendly representation learning via instance discrimination and feature decorrelation.arXiv preprint arXiv:2106.00131,

  74. [74]

    Mice: Mixture of contrastive experts for unsupervised image clustering

    Tsung Wei Tsai, Chongxuan Li, and Jun Zhu. Mice: Mixture of contrastive experts for unsupervised image clustering. In International conference on learning representations, 2020. 6, 9

  75. [75]

    Scan: Learning to classify images without labels

    Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. InEuropean con- ference on computer vision, pages 268–285. Springer, 2020. 5, 6, 7, 9, 10

  76. [76]

    Large language mod- els enable few-shot clustering, 2023

    Vijay Viswanathan, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, and Graham Neubig. Large language mod- els enable few-shot clustering, 2023. 2, 4, 8

  77. [77]

    Constrained k-means clustering with background knowledge

    Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schr ¨odl, et al. Constrained k-means clustering with background knowledge. InIcml, pages 577–584, 2001. 5, 6, 9

  78. [78]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report,

  79. [79]

    Bridge the gap between supervised and unsupervised learning for fine-grained classification.In- formation Sciences, 649:119653, 2023

    Jiabao Wang, Yang Li, Xiu-Shen Wei, Hang Li, Zhuang Miao, and Rui Zhang. Bridge the gap between supervised and unsupervised learning for fine-grained classification.In- formation Sciences, 649:119653, 2023. 5, 7, 10 11

  80. [80]

    Improving text em- beddings with large language models.arXiv preprint arXiv:2401.00368, 2023

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text em- beddings with large language models.arXiv preprint arXiv:2401.00368, 2023. 2, 3, 5, 7

Showing first 80 references.