pith. sign in

arxiv: 2511.12069 · v3 · submitted 2025-11-15 · 💻 cs.SE · stat.ME

A Code Smell Refactoring Approach using GNNs

Pith reviewed 2026-05-17 22:24 UTC · model grok-4.3

classification 💻 cs.SE stat.ME
keywords code smellsrefactoringgraph neural networksGNNsoftware maintainabilitylong methodlarge classfeature envy
0
0 comments X

The pith

Graph neural networks on class and method graphs refactor long methods, large classes, and feature envy more effectively than prior techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that modeling source code as graphs at the class and method levels and feeding those graphs to graph neural networks produces more accurate refactoring suggestions for three common code smells than metrics-based, rule-based, or other deep learning methods. A sympathetic reader would care because code smells reduce maintainability over time, and a data-driven graph approach could cut the manual effort needed to keep large codebases healthy. The work supports the claim with a semi-automated dataset builder and experiments using GCN, GraphSAGE, and GAT on both graph classification and node classification tasks.

Core claim

The authors show that by constructing class-level and method-level input graphs and applying graph neural networks to graph classification and node classification tasks, their approach delivers higher refactoring performance for long method, large class, and feature envy smells than traditional and state-of-the-art deep learning baselines, as measured on a large semi-automatically generated dataset.

What carries the argument

Class-level and method-level code graphs processed by graph neural networks (GCN, GraphSAGE, GAT) to perform classification tasks that decide refactoring operations.

If this is right

  • Refactoring tools can reduce reliance on manually defined heuristics and thresholds.
  • Automated detection and correction of design flaws becomes more reliable at scale.
  • Software evolution benefits from data-driven suggestions that address maintainability issues earlier.
  • Development environments could integrate graph-based models to handle larger projects with less human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph encoding might be tested on additional code smells such as god class or shotgun surgery to check broader applicability.
  • Combining these GNN outputs with existing static analysis tools could create hybrid systems that verify suggested refactorings before they are applied.
  • If the performance edge holds on diverse languages, the method could shift refactoring practice toward structural graph representations rather than metric thresholds alone.

Load-bearing premise

The semi-automated dataset generation produces training examples that are representative of real-world code smells and the chosen class-level and method-level graph encodings preserve the information needed for accurate refactoring decisions.

What would settle it

Evaluating the trained GNN models on an independently collected and manually labeled set of real-world code examples from open-source projects and finding that their refactoring precision or recall falls below the best baseline methods.

read the original abstract

Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past decades, a variety of refactoring approaches have been proposed, which can be broadly classified into metrics-based, rule-based, and machine learning-based approaches. Recent years, deep learning-based approaches have also attracted widespread attention. However, existing techniques exhibit various limitations. Metrics- and rule-based approaches rely heavily on manually defined heuristics and thresholds, whereas deep learning-based approaches are often constrained by dataset availability and model design. In this study, we proposed a graph-based deep learning approach for code smell refactoring. Specifically, we designed two types of input graphs (class-level and method-level) and employed both graph classification and node classification tasks to address the refactoring of three representative code smells: long method, large class, and feature envy. In our experiment, we propose a semi-automated dataset generation approach that could generate a large-scale dataset with minimal manual effort. We implemented the proposed approach with three classical GNN (graph neural network) architectures: GCN, GraphSAGE, and GAT, and evaluated its performance against both traditional and state-of-the-art deep learning approaches. The results demonstrate that proposed approach achieves superior refactoring performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a graph-based deep learning approach for code smell refactoring. It defines class-level and method-level graph encodings and applies graph classification and node classification tasks using GCN, GraphSAGE, and GAT to refactor long method, large class, and feature envy smells. A semi-automated dataset generation process is introduced to produce a large-scale training set with minimal manual effort, and the approach is evaluated against traditional and state-of-the-art deep learning baselines, with the abstract claiming superior refactoring performance.

Significance. If the empirical results hold after proper validation, the work could contribute to automated refactoring by showing how graph neural networks capture structural dependencies better than metric- or rule-based methods. The semi-automated dataset generation and dual graph encodings represent practical strengths for scalability and reproducibility if they are shown to avoid label bias.

major comments (2)
  1. [Abstract] Abstract: the central claim that the proposed GNN approach 'achieves superior refactoring performance' for long method, large class, and feature envy is unsupported because no quantitative metrics (precision, recall, F1, accuracy), statistical tests, error bars, baseline implementation details, or dataset statistics are reported. This absence makes the primary result unverifiable and load-bearing for acceptance.
  2. [Dataset generation] Dataset generation section: the semi-automated labeling process must be shown to produce refactoring targets independent of existing rule-based or metric-based detectors. If positive/negative examples or target refactorings (e.g., extract-method locations, move-method targets) are derived from the same heuristics the paper criticizes, then both the GNN models and the traditional baselines are effectively evaluated against the same underlying rules, undermining claims of genuine improvement.
minor comments (2)
  1. Clarify the exact node and edge features used in the class-level versus method-level graphs and provide a small illustrative example or figure.
  2. Add a table summarizing dataset size, class balance, and train/validation/test splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the proposed GNN approach 'achieves superior refactoring performance' for long method, large class, and feature envy is unsupported because no quantitative metrics (precision, recall, F1, accuracy), statistical tests, error bars, baseline implementation details, or dataset statistics are reported. This absence makes the primary result unverifiable and load-bearing for acceptance.

    Authors: We agree that the abstract should include quantitative support for the central claim. The experimental results section reports F1 scores, precision, recall, accuracy, statistical significance tests, and comparisons to baselines for the GCN, GraphSAGE, and GAT models on the three smells, along with dataset statistics. We will revise the abstract to concisely summarize these key metrics and the evaluation setup so the primary result is verifiable from the abstract itself. revision: yes

  2. Referee: [Dataset generation] Dataset generation section: the semi-automated labeling process must be shown to produce refactoring targets independent of existing rule-based or metric-based detectors. If positive/negative examples or target refactorings (e.g., extract-method locations, move-method targets) are derived from the same heuristics the paper criticizes, then both the GNN models and the traditional baselines are effectively evaluated against the same underlying rules, undermining claims of genuine improvement.

    Authors: We acknowledge the need to demonstrate independence. The semi-automated process identifies refactoring targets from historical commits in open-source repositories (actual extract-method and move-method changes) rather than applying the rule-based or metric-based detectors critiqued in the paper, with a small manually validated subset for quality control. To address the concern explicitly, we will expand the dataset generation section with details on data sources, the independence from criticized heuristics, dataset statistics, and a discussion of potential label biases. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on standard GNNs and external baselines

full rationale

The paper presents a GNN-based refactoring method using class- and method-level graphs with GCN/GraphSAGE/GAT, trained on a semi-automated dataset and evaluated against traditional and SOTA baselines. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described approach. The central performance claims rest on comparative experiments against independent methods rather than reducing to the paper's own inputs by construction. The dataset generation is framed as addressing availability constraints without evidence of tautological label derivation in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that standard GNN layers can learn refactoring decisions from the constructed graphs and that the semi-automated labeling process yields sufficiently accurate ground truth; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Graph representations of classes and methods preserve the structural information required to decide refactorings for the three target smells.
    Invoked when the authors state that class-level and method-level graphs are used for classification tasks.
  • domain assumption Semi-automated dataset generation produces labels that are reliable enough for supervised training.
    Stated as enabling large-scale training with minimal manual effort.

pith-pipeline@v0.9.0 · 5519 in / 1385 out tokens · 29091 ms · 2026-05-17T22:24:53.553755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Refactoring: Improving the Design of Existing Code

    Fowler M, Beck K, Brant J, Opdyke W. Refactoring: Improving the Design of Existing Code. USA; 1999

  2. [2]

    Software smell detection techniques: A systematic lit- erature review

    AbuHassan A, Alshayeb M, Ghouti L. Software smell detection techniques: A systematic lit- erature review. Journal of Software: Evolution and Process. 2021; 33:e2320

  3. [3]

    A survey on software smells

    Sharma T, Spinellis D. A survey on software smells. Journal of Systems and Software 2018; 138:158–173

  4. [4]

    Object-Oriented Metrics in Practice

    Lanza M and Marinescu R. Object-Oriented Metrics in Practice. Berlin: Springer; 2006

  5. [5]

    Identification of extract method refactoring opportunities for the decomposition of methods

    Tsantalis N, Chatzigeorgiou A. Identification of extract method refactoring opportunities for the decomposition of methods. Journal of Systems and Software 2011; 84:1757-1782

  6. [6]

    Identification of Extract Method Refactoring Opportunities

    Tsantalis N, Chatzigeorgiou A. Identification of Extract Method Refactoring Opportunities. Process European Conference on Software Maintenance and Reengineering 2009; 119-128

  7. [7]

    https://github.com/tsantalis/JDeodorant (accessed 2025-04-29)

    JDeodorant. https://github.com/tsantalis/JDeodorant (accessed 2025-04-29)

  8. [8]

    An automated extract method refactoring ap- proach to correct the long method code smell

    Shahidi M, Ashtiani M, Zakeri-Nasrabadi M. An automated extract method refactoring ap- proach to correct the long method code smell. Journal of Systems and Software 2022; 187:111221

  9. [9]

    Comparing and experimenting machine learning techniques for code smell detection

    Fontana F A, Mäntylä M V, Zanoni M, Marino A. Comparing and experimenting machine learning techniques for code smell detection. Empirical Software Engineering 2016; 21:1143- 1191

  10. [10]

    Long Method Detection Using Graph Convolutional Networks

    Zhang HY, Kishi T. Long Method Detection Using Graph Convolutional Networks. Journal of Information Processing 2023; 31:469-477

  11. [11]

    Large Class Detection Using GNNs: A Graph Based Deep Learning Ap- proach Utilizing Three Typical GNN Model Architectures

    Zhang HY, Kishi T. Large Class Detection Using GNNs: A Graph Based Deep Learning Ap- proach Utilizing Three Typical GNN Model Architectures. IEICE Transactions on Information and Systems 2024; E107.D:1140-1150

  12. [12]

    A new model for learning in graph domains

    Gori M, Monfardini G, Scarselli F. A new model for learning in graph domains. IEEE Inter- national Joint Conference on Neural Networks 2005; 2:729-734

  13. [13]

    The graph neural network model

    Scarselli F, Gori M, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Transactions on Neural Networks 2009; 20:61–80

  14. [14]

    Deep Learning Based Code Smell Detection

    Liu H, Jin J, Xu Z, Zou Y, Bu Y, Zhang L. Deep Learning Based Code Smell Detection. IEEE Transactions on Software Engineering 2021; 47:1811-1837

  15. [15]

    Size and cohesion metrics as indicators of the long method bad smell: An empirical study

    Charalampidou S, Ampatzoglou A, Avgeriou P. Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In Proceedings of the 11th International Confer- ence on Predictive Models and Data Analytics in Software Engineering 2015; 8:168-176

  16. [16]

    Applying the ABC metric to C, C++, and Java, Proc

    Fitzpatrick J. Applying the ABC metric to C, C++, and Java, Proc. More C++ Gems. Cam- bridge University Press 2000:245-264

  17. [17]

    IEEE Transac- tions on Software Engineering 1994; 20:476-493

    Chidamber S.R., Kemerer C.F., A metrics suite for object oriented design. IEEE Transac- tions on Software Engineering 1994; 20:476-493

  18. [18]

    The program dependence graph and its use in opti- mization

    Ferrante J, Ottenstein K J., Warren J D. The program dependence graph and its use in opti- mization. ACM Transactions on Programming Languages and Systems 1987; 9:319-349

  19. [19]

    Recommending automated extract method refactorings

    Silva D, Terra R, Valente M T. Recommending automated extract method refactorings. In Proceedings of the 22nd International Conference on Program Comprehension 2014; 146-156

  20. [20]

    JExtract: An Eclipse Plug-in for Recommending Automated Extract Method Refactorings

    Silva D, Terra R, Valente M T. Jextract: an eclipse plugin for recommending automated ex- tract method refactorings. arXiv:1506.06086 2015. https://doi.org/10.48550/arxiv.1506.06086

  21. [21]

    An Approach of Extracting God Class Exploiting Both Struc- tural and Semantic Similarity

    Akash P, Sadiq A, Kabir A. An Approach of Extracting God Class Exploiting Both Struc- tural and Semantic Similarity. In Proceedings of the 14th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE) 2019; 427–433

  22. [22]

    Graph neural net- works: A review of methods and applications

    Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M. Graph neural net- works: A review of methods and applications. AI Open 2020; 1:57-81

  23. [23]

    Semi-supervised classification with graph convolutional networks

    Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. ICLR 2017

  24. [24]

    Inductive representation learning on large graphs

    Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. NIPS 2017; 1024-1034

  25. [25]

    Graph Attention Net- works

    Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Net- works. ICLR 2018

  26. [26]

    A hierarchical model for object-oriented design quality assessment

    Bansiya J, Davis C G. A hierarchical model for object-oriented design quality assessment. IEEE Transactions on Software Engineering 2002; 28:4-17

  27. [27]

    http://jedit.org/ (accessed 2025-04-29)

    JEdit. http://jedit.org/ (accessed 2025-04-29)

  28. [28]

    https://github.com/ReactiveX/RxJava (accessed 2025-04-29)

    RxJava. https://github.com/ReactiveX/RxJava (accessed 2025-04-29)

  29. [29]

    https://github.com/junit-team/junit4 (accessed 2025-04-29)

    Junit4. https://github.com/junit-team/junit4 (accessed 2025-04-29)

  30. [30]

    https://github.com/mybatis/mybatis-3 (accessed 2025-04-29)

    Mybatis3. https://github.com/mybatis/mybatis-3 (accessed 2025-04-29)

  31. [31]

    https://github.com/netty/netty (accessed 2025-04-29)

    Netty. https://github.com/netty/netty (accessed 2025-04-29)

  32. [32]

    https://github.com/gephi/gephi (accessed 2025-04-29)

    Gephi. https://github.com/gephi/gephi (accessed 2025-04-29)

  33. [33]

    https://github.com/plantuml/plantuml (accessed 2025-04-29)

    Plantuml. https://github.com/plantuml/plantuml (accessed 2025-04-29)

  34. [34]

    https://github.com/gavalian/groot (accessed 2025-04-29)

    Groot. https://github.com/gavalian/groot (accessed 2025-04-29)

  35. [35]

    https://github.com/jagrosh/MusicBot (accessed 2025-04-29)

    MusicBot. https://github.com/jagrosh/MusicBot (accessed 2025-04-29)

  36. [36]

    https://github.com/traccar/traccar (accessed 2025-04-29)

    Traccar. https://github.com/traccar/traccar (accessed 2025-04-29)

  37. [37]

    https://jgrapht.org (accessed 2025-04-29)

    Jgrapht. https://jgrapht.org (accessed 2025-04-29)

  38. [38]

    https://github.com/libgdx/libgdx (accessed 2025-04-29)

    Libgdx. https://github.com/libgdx/libgdx (accessed 2025-04-29)

  39. [39]

    https://github.com/freeplane/freeplane (accessed 2025-04-29)

    Freeplane. https://github.com/freeplane/freeplane (accessed 2025-04-29)

  40. [40]

    https://github.com/graphhopper/jsprit (accessed 2025-04-29)

    Jsprit. https://github.com/graphhopper/jsprit (accessed 2025-04-29)

  41. [41]

    https://github.com/informatici/openhospital (accessed 2025-04-29)

    Open Hosipital. https://github.com/informatici/openhospital (accessed 2025-04-29)

  42. [42]

    https://github.com/OpenRefine/OpenRefine (accessed 2025-04-29)

    OpenRefine. https://github.com/OpenRefine/OpenRefine (accessed 2025-04-29)

  43. [43]

    https://github.com/tree-sitter/tree-sitter/ (accessed 2025-04-29)

    tree-sitter. https://github.com/tree-sitter/tree-sitter/ (accessed 2025-04-29)

  44. [44]

    PyTorch, https://pytorch.org/ (accessed 2025-04-29)

  45. [45]

    DGL, https://www.dgl.ai/ (accessed 2025-04-29)