A Code Smell Refactoring Approach using GNNs
Pith reviewed 2026-05-17 22:24 UTC · model grok-4.3
The pith
Graph neural networks on class and method graphs refactor long methods, large classes, and feature envy more effectively than prior techniques.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that by constructing class-level and method-level input graphs and applying graph neural networks to graph classification and node classification tasks, their approach delivers higher refactoring performance for long method, large class, and feature envy smells than traditional and state-of-the-art deep learning baselines, as measured on a large semi-automatically generated dataset.
What carries the argument
Class-level and method-level code graphs processed by graph neural networks (GCN, GraphSAGE, GAT) to perform classification tasks that decide refactoring operations.
If this is right
- Refactoring tools can reduce reliance on manually defined heuristics and thresholds.
- Automated detection and correction of design flaws becomes more reliable at scale.
- Software evolution benefits from data-driven suggestions that address maintainability issues earlier.
- Development environments could integrate graph-based models to handle larger projects with less human oversight.
Where Pith is reading between the lines
- The same graph encoding might be tested on additional code smells such as god class or shotgun surgery to check broader applicability.
- Combining these GNN outputs with existing static analysis tools could create hybrid systems that verify suggested refactorings before they are applied.
- If the performance edge holds on diverse languages, the method could shift refactoring practice toward structural graph representations rather than metric thresholds alone.
Load-bearing premise
The semi-automated dataset generation produces training examples that are representative of real-world code smells and the chosen class-level and method-level graph encodings preserve the information needed for accurate refactoring decisions.
What would settle it
Evaluating the trained GNN models on an independently collected and manually labeled set of real-world code examples from open-source projects and finding that their refactoring precision or recall falls below the best baseline methods.
read the original abstract
Code smell is a great challenge in software refactoring, which indicates latent design or implementation flaws that may degrade the software maintainability and evolution. Over the past decades, a variety of refactoring approaches have been proposed, which can be broadly classified into metrics-based, rule-based, and machine learning-based approaches. Recent years, deep learning-based approaches have also attracted widespread attention. However, existing techniques exhibit various limitations. Metrics- and rule-based approaches rely heavily on manually defined heuristics and thresholds, whereas deep learning-based approaches are often constrained by dataset availability and model design. In this study, we proposed a graph-based deep learning approach for code smell refactoring. Specifically, we designed two types of input graphs (class-level and method-level) and employed both graph classification and node classification tasks to address the refactoring of three representative code smells: long method, large class, and feature envy. In our experiment, we propose a semi-automated dataset generation approach that could generate a large-scale dataset with minimal manual effort. We implemented the proposed approach with three classical GNN (graph neural network) architectures: GCN, GraphSAGE, and GAT, and evaluated its performance against both traditional and state-of-the-art deep learning approaches. The results demonstrate that proposed approach achieves superior refactoring performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a graph-based deep learning approach for code smell refactoring. It defines class-level and method-level graph encodings and applies graph classification and node classification tasks using GCN, GraphSAGE, and GAT to refactor long method, large class, and feature envy smells. A semi-automated dataset generation process is introduced to produce a large-scale training set with minimal manual effort, and the approach is evaluated against traditional and state-of-the-art deep learning baselines, with the abstract claiming superior refactoring performance.
Significance. If the empirical results hold after proper validation, the work could contribute to automated refactoring by showing how graph neural networks capture structural dependencies better than metric- or rule-based methods. The semi-automated dataset generation and dual graph encodings represent practical strengths for scalability and reproducibility if they are shown to avoid label bias.
major comments (2)
- [Abstract] Abstract: the central claim that the proposed GNN approach 'achieves superior refactoring performance' for long method, large class, and feature envy is unsupported because no quantitative metrics (precision, recall, F1, accuracy), statistical tests, error bars, baseline implementation details, or dataset statistics are reported. This absence makes the primary result unverifiable and load-bearing for acceptance.
- [Dataset generation] Dataset generation section: the semi-automated labeling process must be shown to produce refactoring targets independent of existing rule-based or metric-based detectors. If positive/negative examples or target refactorings (e.g., extract-method locations, move-method targets) are derived from the same heuristics the paper criticizes, then both the GNN models and the traditional baselines are effectively evaluated against the same underlying rules, undermining claims of genuine improvement.
minor comments (2)
- Clarify the exact node and edge features used in the class-level versus method-level graphs and provide a small illustrative example or figure.
- Add a table summarizing dataset size, class balance, and train/validation/test splits.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the proposed GNN approach 'achieves superior refactoring performance' for long method, large class, and feature envy is unsupported because no quantitative metrics (precision, recall, F1, accuracy), statistical tests, error bars, baseline implementation details, or dataset statistics are reported. This absence makes the primary result unverifiable and load-bearing for acceptance.
Authors: We agree that the abstract should include quantitative support for the central claim. The experimental results section reports F1 scores, precision, recall, accuracy, statistical significance tests, and comparisons to baselines for the GCN, GraphSAGE, and GAT models on the three smells, along with dataset statistics. We will revise the abstract to concisely summarize these key metrics and the evaluation setup so the primary result is verifiable from the abstract itself. revision: yes
-
Referee: [Dataset generation] Dataset generation section: the semi-automated labeling process must be shown to produce refactoring targets independent of existing rule-based or metric-based detectors. If positive/negative examples or target refactorings (e.g., extract-method locations, move-method targets) are derived from the same heuristics the paper criticizes, then both the GNN models and the traditional baselines are effectively evaluated against the same underlying rules, undermining claims of genuine improvement.
Authors: We acknowledge the need to demonstrate independence. The semi-automated process identifies refactoring targets from historical commits in open-source repositories (actual extract-method and move-method changes) rather than applying the rule-based or metric-based detectors critiqued in the paper, with a small manually validated subset for quality control. To address the concern explicitly, we will expand the dataset generation section with details on data sources, the independence from criticized heuristics, dataset statistics, and a discussion of potential label biases. revision: yes
Circularity Check
No circularity: derivation relies on standard GNNs and external baselines
full rationale
The paper presents a GNN-based refactoring method using class- and method-level graphs with GCN/GraphSAGE/GAT, trained on a semi-automated dataset and evaluated against traditional and SOTA baselines. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described approach. The central performance claims rest on comparative experiments against independent methods rather than reducing to the paper's own inputs by construction. The dataset generation is framed as addressing availability constraints without evidence of tautological label derivation in the provided text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Graph representations of classes and methods preserve the structural information required to decide refactorings for the three target smells.
- domain assumption Semi-automated dataset generation produces labels that are reliable enough for supervised training.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
class-level and method-level input graphs... GCN, GraphSAGE, GAT... node classification for extract opportunities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Refactoring: Improving the Design of Existing Code
Fowler M, Beck K, Brant J, Opdyke W. Refactoring: Improving the Design of Existing Code. USA; 1999
work page 1999
-
[2]
Software smell detection techniques: A systematic lit- erature review
AbuHassan A, Alshayeb M, Ghouti L. Software smell detection techniques: A systematic lit- erature review. Journal of Software: Evolution and Process. 2021; 33:e2320
work page 2021
-
[3]
Sharma T, Spinellis D. A survey on software smells. Journal of Systems and Software 2018; 138:158–173
work page 2018
-
[4]
Object-Oriented Metrics in Practice
Lanza M and Marinescu R. Object-Oriented Metrics in Practice. Berlin: Springer; 2006
work page 2006
-
[5]
Identification of extract method refactoring opportunities for the decomposition of methods
Tsantalis N, Chatzigeorgiou A. Identification of extract method refactoring opportunities for the decomposition of methods. Journal of Systems and Software 2011; 84:1757-1782
work page 2011
-
[6]
Identification of Extract Method Refactoring Opportunities
Tsantalis N, Chatzigeorgiou A. Identification of Extract Method Refactoring Opportunities. Process European Conference on Software Maintenance and Reengineering 2009; 119-128
work page 2009
-
[7]
https://github.com/tsantalis/JDeodorant (accessed 2025-04-29)
JDeodorant. https://github.com/tsantalis/JDeodorant (accessed 2025-04-29)
work page 2025
-
[8]
An automated extract method refactoring ap- proach to correct the long method code smell
Shahidi M, Ashtiani M, Zakeri-Nasrabadi M. An automated extract method refactoring ap- proach to correct the long method code smell. Journal of Systems and Software 2022; 187:111221
work page 2022
-
[9]
Comparing and experimenting machine learning techniques for code smell detection
Fontana F A, Mäntylä M V, Zanoni M, Marino A. Comparing and experimenting machine learning techniques for code smell detection. Empirical Software Engineering 2016; 21:1143- 1191
work page 2016
-
[10]
Long Method Detection Using Graph Convolutional Networks
Zhang HY, Kishi T. Long Method Detection Using Graph Convolutional Networks. Journal of Information Processing 2023; 31:469-477
work page 2023
-
[11]
Zhang HY, Kishi T. Large Class Detection Using GNNs: A Graph Based Deep Learning Ap- proach Utilizing Three Typical GNN Model Architectures. IEICE Transactions on Information and Systems 2024; E107.D:1140-1150
work page 2024
-
[12]
A new model for learning in graph domains
Gori M, Monfardini G, Scarselli F. A new model for learning in graph domains. IEEE Inter- national Joint Conference on Neural Networks 2005; 2:729-734
work page 2005
-
[13]
The graph neural network model
Scarselli F, Gori M, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Transactions on Neural Networks 2009; 20:61–80
work page 2009
-
[14]
Deep Learning Based Code Smell Detection
Liu H, Jin J, Xu Z, Zou Y, Bu Y, Zhang L. Deep Learning Based Code Smell Detection. IEEE Transactions on Software Engineering 2021; 47:1811-1837
work page 2021
-
[15]
Size and cohesion metrics as indicators of the long method bad smell: An empirical study
Charalampidou S, Ampatzoglou A, Avgeriou P. Size and cohesion metrics as indicators of the long method bad smell: An empirical study. In Proceedings of the 11th International Confer- ence on Predictive Models and Data Analytics in Software Engineering 2015; 8:168-176
work page 2015
-
[16]
Applying the ABC metric to C, C++, and Java, Proc
Fitzpatrick J. Applying the ABC metric to C, C++, and Java, Proc. More C++ Gems. Cam- bridge University Press 2000:245-264
work page 2000
-
[17]
IEEE Transac- tions on Software Engineering 1994; 20:476-493
Chidamber S.R., Kemerer C.F., A metrics suite for object oriented design. IEEE Transac- tions on Software Engineering 1994; 20:476-493
work page 1994
-
[18]
The program dependence graph and its use in opti- mization
Ferrante J, Ottenstein K J., Warren J D. The program dependence graph and its use in opti- mization. ACM Transactions on Programming Languages and Systems 1987; 9:319-349
work page 1987
-
[19]
Recommending automated extract method refactorings
Silva D, Terra R, Valente M T. Recommending automated extract method refactorings. In Proceedings of the 22nd International Conference on Program Comprehension 2014; 146-156
work page 2014
-
[20]
JExtract: An Eclipse Plug-in for Recommending Automated Extract Method Refactorings
Silva D, Terra R, Valente M T. Jextract: an eclipse plugin for recommending automated ex- tract method refactorings. arXiv:1506.06086 2015. https://doi.org/10.48550/arxiv.1506.06086
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.06086 2015
-
[21]
An Approach of Extracting God Class Exploiting Both Struc- tural and Semantic Similarity
Akash P, Sadiq A, Kabir A. An Approach of Extracting God Class Exploiting Both Struc- tural and Semantic Similarity. In Proceedings of the 14th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE) 2019; 427–433
work page 2019
-
[22]
Graph neural net- works: A review of methods and applications
Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M. Graph neural net- works: A review of methods and applications. AI Open 2020; 1:57-81
work page 2020
-
[23]
Semi-supervised classification with graph convolutional networks
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. ICLR 2017
work page 2017
-
[24]
Inductive representation learning on large graphs
Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. NIPS 2017; 1024-1034
work page 2017
-
[25]
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Net- works. ICLR 2018
work page 2018
-
[26]
A hierarchical model for object-oriented design quality assessment
Bansiya J, Davis C G. A hierarchical model for object-oriented design quality assessment. IEEE Transactions on Software Engineering 2002; 28:4-17
work page 2002
-
[27]
http://jedit.org/ (accessed 2025-04-29)
JEdit. http://jedit.org/ (accessed 2025-04-29)
work page 2025
-
[28]
https://github.com/ReactiveX/RxJava (accessed 2025-04-29)
RxJava. https://github.com/ReactiveX/RxJava (accessed 2025-04-29)
work page 2025
-
[29]
https://github.com/junit-team/junit4 (accessed 2025-04-29)
Junit4. https://github.com/junit-team/junit4 (accessed 2025-04-29)
work page 2025
-
[30]
https://github.com/mybatis/mybatis-3 (accessed 2025-04-29)
Mybatis3. https://github.com/mybatis/mybatis-3 (accessed 2025-04-29)
work page 2025
-
[31]
https://github.com/netty/netty (accessed 2025-04-29)
Netty. https://github.com/netty/netty (accessed 2025-04-29)
work page 2025
-
[32]
https://github.com/gephi/gephi (accessed 2025-04-29)
Gephi. https://github.com/gephi/gephi (accessed 2025-04-29)
work page 2025
-
[33]
https://github.com/plantuml/plantuml (accessed 2025-04-29)
Plantuml. https://github.com/plantuml/plantuml (accessed 2025-04-29)
work page 2025
-
[34]
https://github.com/gavalian/groot (accessed 2025-04-29)
Groot. https://github.com/gavalian/groot (accessed 2025-04-29)
work page 2025
-
[35]
https://github.com/jagrosh/MusicBot (accessed 2025-04-29)
MusicBot. https://github.com/jagrosh/MusicBot (accessed 2025-04-29)
work page 2025
-
[36]
https://github.com/traccar/traccar (accessed 2025-04-29)
Traccar. https://github.com/traccar/traccar (accessed 2025-04-29)
work page 2025
-
[37]
https://jgrapht.org (accessed 2025-04-29)
Jgrapht. https://jgrapht.org (accessed 2025-04-29)
work page 2025
-
[38]
https://github.com/libgdx/libgdx (accessed 2025-04-29)
Libgdx. https://github.com/libgdx/libgdx (accessed 2025-04-29)
work page 2025
-
[39]
https://github.com/freeplane/freeplane (accessed 2025-04-29)
Freeplane. https://github.com/freeplane/freeplane (accessed 2025-04-29)
work page 2025
-
[40]
https://github.com/graphhopper/jsprit (accessed 2025-04-29)
Jsprit. https://github.com/graphhopper/jsprit (accessed 2025-04-29)
work page 2025
-
[41]
https://github.com/informatici/openhospital (accessed 2025-04-29)
Open Hosipital. https://github.com/informatici/openhospital (accessed 2025-04-29)
work page 2025
-
[42]
https://github.com/OpenRefine/OpenRefine (accessed 2025-04-29)
OpenRefine. https://github.com/OpenRefine/OpenRefine (accessed 2025-04-29)
work page 2025
-
[43]
https://github.com/tree-sitter/tree-sitter/ (accessed 2025-04-29)
tree-sitter. https://github.com/tree-sitter/tree-sitter/ (accessed 2025-04-29)
work page 2025
-
[44]
PyTorch, https://pytorch.org/ (accessed 2025-04-29)
work page 2025
-
[45]
DGL, https://www.dgl.ai/ (accessed 2025-04-29)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.