GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback

Hang Su; Henry Peng Zou; Jason Cai; Jianfeng He; Saab Mansour; Siffi Singh; Yi Nian

arxiv: 2502.18414 · v2 · submitted 2025-02-25 · 💻 cs.CL · cs.LG

GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback

Henry Peng Zou , Siffi Singh , Yi Nian , Jianfeng He , Jason Cai , Saab Mansour , Hang Su This is my paper

Pith reviewed 2026-05-23 01:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Generalized Category DiscoveryLLM FeedbackActive LearningNovel Category RecognitionContrastive FeaturesCategory DescriptionsSemi-supervised Learning

0 comments

The pith

GLEAN improves generalized category discovery by using three types of LLM feedback to correct errors on confusing instances and generate semantic descriptions for new clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve generalized category discovery, where models must identify both known and novel categories from mostly unlabeled data after seeing limited labels on known ones. Earlier approaches struggle to fix mistakes on hard examples and to extract usable meaning from discovered clusters, often requiring costly extra human labels to work in practice. GLEAN tackles this by actively soliciting three distinct forms of feedback from large language models: refining instance contrastive features, producing category descriptions, and matching uncertain cases to those descriptions. Experiments across multiple datasets and supervision levels show consistent gains over prior methods. A reader would care if the approach can reduce dependence on human annotation while still delivering usable open-world recognition.

Core claim

GLEAN is a unified framework that actively learns from diverse and collaborative LLM feedback to improve instance-level contrastive features, generate category descriptions, and align uncertain instances with LLM-selected descriptions, thereby recognizing both known and novel categories more effectively than prior GCD methods.

What carries the argument

The GLEAN framework that integrates three specific LLM feedback signals for active correction and semantic labeling in generalized category discovery.

If this is right

LLM feedback produces improved instance-level contrastive features
Category descriptions generated by LLMs capture semantic meanings of clusters
Uncertain instances are aligned with LLM-selected category descriptions
Performance exceeds state-of-the-art models across datasets, metrics, and supervision settings

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the cost of deploying open-world systems in domains where human labeling is impractical
Similar LLM feedback loops might transfer to other clustering or semi-supervised tasks that need interpretable outputs
Performance gains may grow if future LLMs deliver higher-quality or less biased feedback without extra prompting

Load-bearing premise

LLM feedback is assumed to be sufficiently accurate, diverse, and free of systematic bias to reliably correct errors on confusing instances and produce semantically meaningful category descriptions.

What would settle it

A controlled ablation in which removing the LLM feedback components yields no performance gain over the non-LLM baseline, or human raters judge the generated category descriptions as inaccurate or unhelpful.

read the original abstract

Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and collaborative LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of GLEAN over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLEAN puts three LLM feedback signals into one active GCD pipeline but treats LLM accuracy as given without checks or ablations.

read the letter

The main takeaway is that this paper offers a single framework, GLEAN, that feeds LLM outputs into three places: contrastive feature updates, category description generation, and alignment of uncertain points to those descriptions. The goal is to reduce the human annotation burden that usually follows standard GCD methods when they hit confusing instances or vague clusters. That combination of signals is presented as new for this setting, and the abstract frames it as a practical fix for open-world recognition with limited labels from known classes. Releasing the code is a plus for anyone who wants to test the pipeline directly. The experiments are described as covering multiple datasets, metrics, and supervision levels, which at least shows an attempt to demonstrate broader applicability. The soft spot is the missing validation of the LLM feedback itself. Nothing in the provided abstract quantifies how often the LLM gets the features, descriptions, or alignments wrong on the target data, compares those outputs to human judgments, or runs controlled noise tests. If the LLM signals contain consistent biases or hallucinations, the claimed gains over prior GCD baselines would not be cleanly attributable to the framework. That assumption sits at the center of the method, so its absence is noticeable even on a first read. The work is aimed at people already working on generalized category discovery or active learning who are looking for ways to bring in LLMs without full supervision. A reader in that niche could extract the three-signal design and try it on their own data. It is worth sending for peer review because the problem is real and the proposed structure is concrete enough to evaluate, though any referee would need to see the missing LLM reliability checks and ablations before the performance claims can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes GLEAN, a unified active learning framework for Generalized Category Discovery (GCD) that incorporates three distinct types of feedback from large language models: (1) to refine instance-level contrastive features, (2) to generate semantic category descriptions, and (3) to align uncertain instances with LLM-selected descriptions. The work claims that this approach addresses limitations of prior GCD methods in error rectification and semantic leverage, and reports superior performance over state-of-the-art baselines across multiple datasets, metrics, and supervision regimes.

Significance. If the reported gains prove robust and the LLM feedback mechanisms are shown to be reliable, the framework could meaningfully advance practical open-world recognition by reducing dependence on expensive human annotations. The explicit use of diverse, collaborative LLM signals to target specific GCD pain points (confusing instances and cluster semantics) represents a timely integration of LLM capabilities into semi-supervised discovery pipelines.

major comments (2)

[Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.
[§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.

minor comments (1)

The GitHub link is provided but no reproducibility checklist or hyperparameter details appear in the abstract; these should be expanded in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and recognition of the timely aspects of integrating diverse LLM signals into GCD. We address each major comment below with clarifications and commit to revisions that directly target the raised concerns on evidence for LLM reliability and experimental robustness.

read point-by-point responses

Referee: [Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.

Authors: We agree the manuscript does not provide explicit quantification of LLM error rates, direct human annotation comparisons, or controlled noise ablations. The framework relies on active selection and multi-type feedback to mitigate errors, with gains shown empirically across settings. To strengthen attribution, the revision will add a dedicated analysis subsection evaluating LLM feedback quality on sampled data, including error estimates and noise robustness tests. revision: yes
Referee: [§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.

Authors: Section 4 already details comparisons to multiple SOTA baselines across datasets, metrics, and supervision regimes, with the full manuscript providing implementation specifics. To improve evaluability, the revision will expand this section with additional ablation results, statistical significance reporting (e.g., p-values), and sensitivity analyses to LLM variants and prompt changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external experiments, not self-referential definitions or fits

full rationale

The paper introduces GLEAN as a framework that applies three types of LLM feedback to improve contrastive features, generate descriptions, and align instances in generalized category discovery. Performance superiority is asserted via extensive experiments on diverse datasets rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are presented in the provided text that would create self-definitional loops or rename known results. The central motivation (LLM feedback correcting errors) is treated as an assumption whose validity is tested empirically, not presupposed in the method definition itself. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5723 in / 971 out tokens · 32489 ms · 2026-05-23T01:50:54.594745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We mine ambiguous instances based on entropy... H_i = -∑ p_ik log p_ik ... Neighborhood Contrastive Learning... L_ncl ... Pseudo Category Alignment L_align
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

K-Means clustering... Student’s t-distribution... LLM feedback types

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
cs.CV 2026-04 unverdicted novelty 7.0

OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.