pith. sign in

arxiv: 2605.27913 · v1 · pith:W74AOBBBnew · submitted 2026-05-27 · 💻 cs.LG

Where LLM Annotators Fail: Label-Free Learning on Graphs with LLMs

Pith reviewed 2026-06-29 14:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM annotation noiselabel-free graph learningcluster-conditional noisepseudo-label correctionnode classificationgraph neural networks
0
0 comments X

The pith

LLM labels on graphs show reliability that varies by feature-space cluster as well as class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that errors in LLM annotations for graph nodes are region-dependent within classes. CANE is proposed to estimate these cluster-specific noise rates label-free. This estimation guides which pseudo-labels to trust or correct when training GNNs. The method outperforms baselines particularly on datasets with pronounced cluster-conditional noise patterns.

Core claim

LLM annotation errors are both class-dependent and region-dependent. CANE estimates cluster-conditional LLM reliability without ground truth labels and uses this to decide which pseudo-labels to trust and which to correct.

What carries the argument

Cluster-Aware Noise Estimation (CANE) that partitions nodes into feature-space clusters and estimates per-cluster noise rates from the data distribution alone.

If this is right

  • Improved accuracy in node classification using noisy LLM labels on graphs.
  • Greater gains on datasets where LLM reliability varies across clusters within classes.
  • Applicability across different GNN architectures as a label-free framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that cluster detection could be a general tool for handling heterogeneous annotation quality in semi-supervised learning.
  • Future work could test whether the same cluster structure appears in other modalities like text or images with LLM labels.

Load-bearing premise

The clusters found in feature space correspond to distinct regions of LLM labeling reliability, making it possible to estimate noise rates without any true labels.

What would settle it

Observing no performance gain from CANE on a graph dataset where LLM errors are uniform across clusters within each class would falsify the utility of the cluster-conditional approach.

Figures

Figures reproduced from arXiv: 2605.27913 by Chuxu Zhang, Jiatan Huang, Safal Thapaliya.

Figure 1
Figure 1. Figure 1: A single class-level accuracy hides large gaps between clusters of the same class. (a) The class-level view gives each DBLP class one average LLM accuracy. (b) Splitting one class into feature-space clusters reveals wide within-class variation in LLM accuracy. (c) Low-to-high cluster reliability gaps are consistently observed across different graph benchmarks. signs the same reliability to regions where th… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of CANE pipeline. (1) A representative seed set is selected, which is used to (2) estimate a transition matrix Tc. The estimated Tc then guides (3) pseudo-label expansion and (4) cluster-conditional iterative label correction, after which (5) a final GNN is trained on the refined labels to produce node predictions. 2.2 Overview The difficulty identified above, that the LLM errs at rates which … view at source ↗
Figure 3
Figure 3. Figure 3: Component ablation: mean accuracy change (pp) from removing each method component. question is how much probe support the Tc requires to become reliable. Sweeping the probe fraction ρ ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Probe-budget ablation: test accuracy vs. ρ, the share of the budget spent estimating Tc (GCN). Citeseer Cora PubMed WikiCS DBLP 70 74 78 82 test accuracy (%) C 1.5C 2C 3C [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to the number of clusters K ∈ {C, 1.5C, 2C, 3C}. Number of clusters K. CANE partitions each graph into K = 2C clusters to condition Tc, trad￾ing finer noise resolution against fewer annotations. Sweeping K ∈ {C, 1.5C, 2C, 3C} ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Budget sensitivity (GCN, 5-seed mean over the five datasets): test accuracy of CANE and LOCLE from 25% to 100% of the full annotation budget. 3.6 Cost and Efficiency In label-free graph learning, the dominant cost is the LLM API, not GNN training. All pipelines use the same annotator and query budget (Appendix G), so their annotation token cost is identical ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-cell deviation of true per-cluster LLM accuracy from what a global [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Extended diagnostic. (a) Per-cell deviation across six benchmarks including ogbn-arxiv and the synthetic control (superseded by the 5-dataset version in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Zero-shot LLM annotation prompt used to obtain node pseudo-labels. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript argues that LLM-generated pseudo-labels for node classification on graphs exhibit noise that is both class-dependent and region-dependent within the same class, as revealed by feature-space clusters. It proposes Cluster-Aware Noise Estimation (CANE), a label-free framework that estimates per-cluster LLM reliability parameters without ground-truth labels and leverages these to selectively trust or correct pseudo-labels before training GNNs. Experiments across graph benchmarks and backbones show gains over prior label-free methods, largest where cluster-conditional noise is pronounced.

Significance. If the cluster-conditional estimation is valid, CANE offers a practical way to improve label-free graph learning when semantic node attributes allow LLM annotation. The empirical results on datasets with stronger region-dependent noise patterns provide concrete evidence of utility and highlight an under-explored structure in LLM supervision errors. The work supplies reproducible experimental comparisons that can serve as a baseline for future label-efficient graph methods.

major comments (1)
  1. [§3] §3 (CANE estimation procedure): the central claim that cluster-conditional noise rates are recoverable from the observed distribution of LLM pseudo-labels alone is not accompanied by an identifiability argument or set of sufficient conditions (e.g., on class priors per cluster, conditional independence, or within-cluster label consistency). Without such analysis the recovered reliability parameters used for trust/correction decisions rest on an unverified assumption, directly affecting the validity of the subsequent learning pipeline.
minor comments (2)
  1. [Abstract, §4] The abstract and §4 could more explicitly state the clustering algorithm, number of clusters chosen, and any hyper-parameters controlling the noise estimation step so that the method is fully reproducible from the text.
  2. [Figure 2] Figure 2 (or equivalent visualization of cluster-conditional error rates) would benefit from error bars or multiple random seeds to convey variability in the estimated reliabilities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the identifiability of the CANE procedure. We respond to the major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (CANE estimation procedure): the central claim that cluster-conditional noise rates are recoverable from the observed distribution of LLM pseudo-labels alone is not accompanied by an identifiability argument or set of sufficient conditions (e.g., on class priors per cluster, conditional independence, or within-cluster label consistency). Without such analysis the recovered reliability parameters used for trust/correction decisions rest on an unverified assumption, directly affecting the validity of the subsequent learning pipeline.

    Authors: We agree that §3 presents the CANE estimation procedure without an explicit identifiability analysis. The method models the observed pseudo-label distribution within each cluster as arising from a mixture of true-class distributions corrupted by cluster-specific noise rates, and recovers the parameters by maximum-likelihood estimation (via an EM-style procedure) under the modeling assumptions that (i) clusters are homogeneous with respect to LLM annotation behavior and (ii) the feature-space clustering induces regions where class-conditional label distributions are sufficiently distinct. While the empirical results across benchmarks provide supporting evidence, the referee is correct that sufficient conditions (such as variation in class priors across clusters or conditional independence of pseudo-labels given cluster and true class) are not formally stated. We will revise the manuscript to add a dedicated paragraph in §3 that articulates these assumptions and the conditions under which the cluster-conditional noise rates are identifiable from the observed label distribution alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided abstract and description present CANE as a new label-free framework that estimates cluster-conditional noise rates from the observed distribution of LLM pseudo-labels and feature clusters, then applies those estimates to trust/correct decisions. No equations, self-citations, or definitional steps are visible that would reduce the estimation to a tautology or fitted input renamed as prediction. The identifiability concern raised by the skeptic is a question of modeling assumptions rather than a demonstrated reduction of the claimed result to its own inputs by construction. The derivation is therefore treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or new entities introduced; full methods would be required to identify any.

pith-pipeline@v0.9.1-grok · 5710 in / 1069 out tokens · 41355 ms · 2026-06-29T14:23:12.688587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2017. Active learning for graph embedding. arXiv preprint arXiv:1705.05085

  4. [4]

    Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. 2024 a . Llaga: Large language and graph assistant. In International Conference on Machine Learning (ICML)

  5. [5]

    Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, and Jiliang Tang. 2023. https://arxiv.org/abs/2307.03393 Exploring the potential of large language models ( LLMs ) in learning on graphs . SIGKDD Explorations

  6. [6]

    Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, and Jiliang Tang. 2024 b . https://arxiv.org/abs/2310.04668 Label-free node classification on graphs with large language models ( LLMs ) . In International Conference on Learning Representations (ICLR)

  7. [7]

    De Cheng, Tongliang Liu, Yixiong Ning, Nannan Wang, Bo Han, Gang Niu, Xinbo Gao, and Masashi Sugiyama. 2022. Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  8. [8]

    Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inderjit S. Dhillon. 2022. Node feature extraction by self-supervised multi-scale neighborhood prediction. In International Conference on Learning Representations (ICLR)

  9. [9]

    Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023. Is gpt-3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)

  10. [10]

    Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences (PNAS), 120(30):e2305016120

  11. [11]

    Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi. 2024. https://arxiv.org/abs/2305.19523 Harnessing explanations: LLM -to- LM interpreter for enhanced text-attributed graph representation learning . In International Conference on Learning Representations (ICLR)

  12. [12]

    Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang. 2023. https://arxiv.org/abs/2304.04779 GraphMAE2 : A decoding-enhanced masked self-supervised graph learner . In Proceedings of the ACM Web Conference (WWW)

  13. [13]

    Shengding Hu, Zheng Xiong, Meng Qu, Xingdi Yuan, Marc-Alexandre C \^o t \'e , Zhiyuan Liu, and Jian Tang. 2020. Graph policy network for transferable active learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS)

  14. [14]

    Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. 2024. Large language models on graphs: A comprehensive survey. IEEE Transactions on Knowledge and Data Engineering (TKDE)

  15. [15]

    Suyeon Kim, SeongKu Kang, Dongwoo Kim, Jungseul Ok, and Hwanjo Yu. 2025. https://arxiv.org/abs/2506.12468 Delving into instance-dependent label noise in graph data: A comprehensive study and benchmark . In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

  16. [16]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR)

  17. [17]

    Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. 2026. Graph is a substrate across data modalities. arXiv preprint arXiv:2601.22384

  18. [18]

    Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. 2024. One for all: Towards training one graph model for all classification tasks. In International Conference on Learning Representations (ICLR)

  19. [19]

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. 2020. https://arxiv.org/abs/2007.00151 Early-learning regularization prevents memorization of noisy labels . In Advances in Neural Information Processing Systems (NeurIPS)

  20. [20]

    Jiaqi Ma, Ziqiao Ma, Joyce Chai, and Qiaozhu Mei. 2022. Partition-based active learning for graph neural networks. arXiv preprint arXiv:2201.09391

  21. [21]

    P \'e ter Mernyei and C a t a lina Cangea. 2020. https://arxiv.org/abs/2007.02901 Wiki- CS : A W ikipedia-based benchmark for graph neural networks . arXiv preprint arXiv:2007.02901

  22. [22]

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. 2017. https://arxiv.org/abs/1609.03683 Making deep neural networks robust to label noise: A loss correction approach . In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  23. [23]

    Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI Magazine, 29(3)

  24. [24]

    Zeang Sheng, Weiyang Guo, Yingxia Shao, Wentao Zhang, and Bin Cui. 2025. LLMs are noisy oracles! LLM -based noise-aware graph active learning for node classification. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

  25. [25]

    Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2022. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

  26. [26]

    Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Suqi Cheng, Dawei Yin, and Chao Huang. 2024. Graphgpt: Graph instruction tuning for large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

  27. [27]

    Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner : Extraction and mining of academic social networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

  28. [28]

    Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye, and Chuxu Zhang. 2025. Semantic refinement with llms for graph representations. arXiv preprint arXiv:2512.21106

  29. [29]

    Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations (ICLR)

  30. [30]

    Zehong Wang, Sidney Liu, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, and Yanfang Ye. 2025. https://arxiv.org/abs/2412.10136 Can LLMs convert graphs to text-attributed graphs? In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  31. [31]

    Yuexin Wu, Yichong Xu, Aarti Singh, Yiming Yang, and Artur Dubrawski. 2019. Active learning for graph neural networks via node feature propagation. arXiv preprint arXiv:1910.07567

  32. [32]

    Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. 2020. https://arxiv.org/abs/2006.07836 Part-dependent label noise: Towards instance-dependent label noise . In Advances in Neural Information Processing Systems (NeurIPS)

  33. [33]

    Hao Yan, Chaozhuo Li, Ruosong Long, Chao Yan, Jianan Zhao, Wenwen Zhuang, Jun Yin, Peiyan Zhang, Weihao Han, Hao Sun, and 1 others. 2023. A comprehensive study on text-attributed graphs: Benchmarking and rethinking. Advances in Neural Information Processing Systems, 36:17238--17264

  34. [34]

    Yu Yao, Tongliang Liu, Mingming Gong, Bo Han, Gang Niu, and Kun Zhang. 2021. https://arxiv.org/abs/2109.02986 Instance-dependent label-noise learning under a structural causal model . In Advances in Neural Information Processing Systems (NeurIPS)

  35. [35]

    Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. 2024. Language is all a graph needs. In Findings of the Association for Computational Linguistics: EACL

  36. [36]

    Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 793--803

  37. [37]

    Taiyan Zhang, Renchi Yang, Yurui Lai, Mingyu Yan, Xiaochun Ye, and Dongrui Fan. 2025. Leveraging large language models for effective label-free node classification in text-attributed graphs. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

  38. [38]

    Wentao Zhang, Yexin Wang, Zhenbang You, Meng Cao, Ping Huang, Jiulong Shan, Zhi Yang, and Bin Cui. 2021. RIM : Reliable influence-based active learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS)

  39. [39]

    Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, and Jian Tang. 2023. https://arxiv.org/abs/2210.14709 Learning on large-scale text-attributed graphs via variational inference . In International Conference on Learning Representations (ICLR)

  40. [40]

    Zhaowei Zhu, Yiwen Song, and Yang Liu. 2021. https://arxiv.org/abs/2102.05291 Clusterability as an alternative to anchor points when learning with noisy labels . In International Conference on Machine Learning (ICML)