pith. sign in

arxiv: 2502.18414 · v2 · submitted 2025-02-25 · 💻 cs.CL · cs.LG

GLEAN: Active Generalized Category Discovery with Diverse LLM Feedback

Pith reviewed 2026-05-23 01:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords Generalized Category DiscoveryLLM FeedbackActive LearningNovel Category RecognitionContrastive FeaturesCategory DescriptionsSemi-supervised Learning
0
0 comments X

The pith

GLEAN improves generalized category discovery by using three types of LLM feedback to correct errors on confusing instances and generate semantic descriptions for new clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve generalized category discovery, where models must identify both known and novel categories from mostly unlabeled data after seeing limited labels on known ones. Earlier approaches struggle to fix mistakes on hard examples and to extract usable meaning from discovered clusters, often requiring costly extra human labels to work in practice. GLEAN tackles this by actively soliciting three distinct forms of feedback from large language models: refining instance contrastive features, producing category descriptions, and matching uncertain cases to those descriptions. Experiments across multiple datasets and supervision levels show consistent gains over prior methods. A reader would care if the approach can reduce dependence on human annotation while still delivering usable open-world recognition.

Core claim

GLEAN is a unified framework that actively learns from diverse and collaborative LLM feedback to improve instance-level contrastive features, generate category descriptions, and align uncertain instances with LLM-selected descriptions, thereby recognizing both known and novel categories more effectively than prior GCD methods.

What carries the argument

The GLEAN framework that integrates three specific LLM feedback signals for active correction and semantic labeling in generalized category discovery.

If this is right

  • LLM feedback produces improved instance-level contrastive features
  • Category descriptions generated by LLMs capture semantic meanings of clusters
  • Uncertain instances are aligned with LLM-selected category descriptions
  • Performance exceeds state-of-the-art models across datasets, metrics, and supervision settings

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower the cost of deploying open-world systems in domains where human labeling is impractical
  • Similar LLM feedback loops might transfer to other clustering or semi-supervised tasks that need interpretable outputs
  • Performance gains may grow if future LLMs deliver higher-quality or less biased feedback without extra prompting

Load-bearing premise

LLM feedback is assumed to be sufficiently accurate, diverse, and free of systematic bias to reliably correct errors on confusing instances and produce semantically meaningful category descriptions.

What would settle it

A controlled ablation in which removing the LLM feedback components yields no performance gain over the non-LLM baseline, or human raters judge the generated category descriptions as inaccurate or unhelpful.

read the original abstract

Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and collaborative LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of GLEAN over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes GLEAN, a unified active learning framework for Generalized Category Discovery (GCD) that incorporates three distinct types of feedback from large language models: (1) to refine instance-level contrastive features, (2) to generate semantic category descriptions, and (3) to align uncertain instances with LLM-selected descriptions. The work claims that this approach addresses limitations of prior GCD methods in error rectification and semantic leverage, and reports superior performance over state-of-the-art baselines across multiple datasets, metrics, and supervision regimes.

Significance. If the reported gains prove robust and the LLM feedback mechanisms are shown to be reliable, the framework could meaningfully advance practical open-world recognition by reducing dependence on expensive human annotations. The explicit use of diverse, collaborative LLM signals to target specific GCD pain points (confusing instances and cluster semantics) represents a timely integration of LLM capabilities into semi-supervised discovery pipelines.

major comments (2)
  1. [Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.
  2. [§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.
minor comments (1)
  1. The GitHub link is provided but no reproducibility checklist or hyperparameter details appear in the abstract; these should be expanded in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and recognition of the timely aspects of integrating diverse LLM signals into GCD. We address each major comment below with clarifications and commit to revisions that directly target the raised concerns on evidence for LLM reliability and experimental robustness.

read point-by-point responses
  1. Referee: [Abstract (motivation paragraph) and §3 (Methods)] Abstract and method description: the central performance claims rest on the premise that LLM feedback is sufficiently accurate, diverse, and unbiased to correct errors on confusing instances and produce meaningful category descriptions, yet no quantification of LLM error rates, comparison to human annotations, or ablation under controlled feedback noise is provided. This assumption is load-bearing for attributing gains to the proposed framework rather than to LLM quality.

    Authors: We agree the manuscript does not provide explicit quantification of LLM error rates, direct human annotation comparisons, or controlled noise ablations. The framework relies on active selection and multi-type feedback to mitigate errors, with gains shown empirically across settings. To strengthen attribution, the revision will add a dedicated analysis subsection evaluating LLM feedback quality on sampled data, including error estimates and noise robustness tests. revision: yes

  2. Referee: [§4 (Experiments)] Experimental section: the claim of 'superior performance ... across diverse datasets, metrics, and supervision settings' cannot be evaluated for robustness without details on baselines, ablation studies, statistical significance, or sensitivity to LLM choice and prompt variations; the abstract alone provides no such evidence.

    Authors: Section 4 already details comparisons to multiple SOTA baselines across datasets, metrics, and supervision regimes, with the full manuscript providing implementation specifics. To improve evaluability, the revision will expand this section with additional ablation results, statistical significance reporting (e.g., p-values), and sensitivity analyses to LLM variants and prompt changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external experiments, not self-referential definitions or fits

full rationale

The paper introduces GLEAN as a framework that applies three types of LLM feedback to improve contrastive features, generate descriptions, and align instances in generalized category discovery. Performance superiority is asserted via extensive experiments on diverse datasets rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are presented in the provided text that would create self-definitional loops or rename known results. The central motivation (LLM feedback correcting errors) is treated as an assumption whose validity is tested empirically, not presupposed in the method definition itself. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5723 in / 971 out tokens · 32489 ms · 2026-05-23T01:50:54.594745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.