GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback
Pith reviewed 2026-05-23 01:50 UTC · model grok-4.3
The pith
GLEAN improves generalized category discovery by using three types of LLM feedback to correct errors on confusing instances and generate semantic descriptions for new clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLEAN is a unified framework that actively learns from diverse and collaborative LLM feedback to improve instance-level contrastive features, generate category descriptions, and align uncertain instances with LLM-selected descriptions, thereby recognizing both known and novel categories more effectively than prior GCD methods.
What carries the argument
The GLEAN framework that integrates three specific LLM feedback signals for active correction and semantic labeling in generalized category discovery.
If this is right
- LLM feedback produces improved instance-level contrastive features
- Category descriptions generated by LLMs capture semantic meanings of clusters
- Uncertain instances are aligned with LLM-selected category descriptions
- Performance exceeds state-of-the-art models across datasets, metrics, and supervision settings
Where Pith is reading between the lines
- The method could lower the cost of deploying open-world systems in domains where human labeling is impractical
- Similar LLM feedback loops might transfer to other clustering or semi-supervised tasks that need interpretable outputs
- Performance gains may grow if future LLMs deliver higher-quality or less biased feedback without extra prompting
Load-bearing premise
LLM feedback is assumed to be sufficiently accurate, diverse, and free of systematic bias to reliably correct errors on confusing instances and produce semantically meaningful category descriptions.
What would settle it
A controlled ablation in which removing the LLM feedback components yields no performance gain over the non-LLM baseline, or human raters judge the generated category descriptions as inaccurate or unhelpful.
read the original abstract
Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and collaborative LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of GLEAN over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GLEAN, a unified active learning framework for Generalized Category Discovery (GCD) that incorporates three distinct types of feedback from large language models: (1) to refine instance-level contrastive features, (2) to generate semantic category descriptions, and (3) to align uncertain instances with LLM-selected descriptions. The work claims that this approach addresses limitations of prior GCD methods in error rectification and semantic leverage, and reports superior performance over state-of-the-art baselines across multiple datasets, metrics, and supervision regimes.
Significance. If the reported gains prove robust and the LLM feedback mechanisms are shown to be reliable, the framework could meaningfully advance practical open-world recognition by reducing dependence on expensive human annotations. The explicit use of diverse, collaborative LLM signals to target specific GCD pain points (confusing instances and cluster semantics) represents a timely integration of LLM capabilities into semi-supervised discovery pipelines.
major comments (2)
- [Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.
- [§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.
minor comments (1)
- The GitHub link is provided but no reproducibility checklist or hyperparameter details appear in the abstract; these should be expanded in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and recognition of the timely aspects of integrating diverse LLM signals into GCD. We address each major comment below with clarifications and commit to revisions that directly target the raised concerns on evidence for LLM reliability and experimental robustness.
read point-by-point responses
-
Referee: [Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.
Authors: We agree the manuscript does not provide explicit quantification of LLM error rates, direct human annotation comparisons, or controlled noise ablations. The framework relies on active selection and multi-type feedback to mitigate errors, with gains shown empirically across settings. To strengthen attribution, the revision will add a dedicated analysis subsection evaluating LLM feedback quality on sampled data, including error estimates and noise robustness tests. revision: yes
-
Referee: [§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.
Authors: Section 4 already details comparisons to multiple SOTA baselines across datasets, metrics, and supervision regimes, with the full manuscript providing implementation specifics. To improve evaluability, the revision will expand this section with additional ablation results, statistical significance reporting (e.g., p-values), and sensitivity analyses to LLM variants and prompt changes. revision: yes
Circularity Check
No circularity: empirical claims rest on external experiments, not self-referential definitions or fits
full rationale
The paper introduces GLEAN as a framework that applies three types of LLM feedback to improve contrastive features, generate descriptions, and align instances in generalized category discovery. Performance superiority is asserted via extensive experiments on diverse datasets rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are presented in the provided text that would create self-definitional loops or rename known results. The central motivation (LLM feedback correcting errors) is treated as an assumption whose validity is tested empirically, not presupposed in the method definition itself. This is the common case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We mine ambiguous instances based on entropy... H_i = -∑ p_ik log p_ik ... Neighborhood Contrastive Learning... L_ncl ... Pseudo Category Alignment L_align
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
K-Means clustering... Student’s t-distribution... LLM feedback types
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.