Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Pith reviewed 2026-05-19 02:38 UTC · model grok-4.3
The pith
Fine-tuning code language models on a dataset of cross-language bugs enables better detection of errors from interactions between different programming languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We constructed a CLB dataset covering Python-C/C++, Java-C/C++, and Python-Java combinations along with nine interaction types, then fine-tuned 13 CodeLMs to classify cross-language code as containing bugs or not. All 13 models showed performance gains after fine-tuning, with UniXcoder-base reaching the highest F1 score of 0.7407. Models fine-tuned on single-language bug data performed poorly on CLB detection, indicating that cross-language bugs differ from single-language ones. Larger fine-tuning datasets improved results, longer token sequences did not necessarily help, and code comments produced mixed effects across models. Smaller CodeLMs tended to perform better than larger ones in the
What carries the argument
The custom CLB dataset built from three programming-language pairs and nine interaction types, which is used to fine-tune CodeLMs for binary classification of cross-language code snippets as buggy or non-buggy.
If this is right
- Larger fine-tuning datasets produce significantly higher detection performance.
- Longer token sequence lengths do not necessarily raise model performance.
- Code comments can raise or lower performance depending on the particular CodeLM.
- Models fine-tuned only on single-language bugs remain ineffective for cross-language cases.
- Smaller CodeLMs can reach higher performance than larger ones under the tested conditions.
Where Pith is reading between the lines
- Teams maintaining multilingual projects could add fine-tuned CodeLMs to existing test suites to catch interaction bugs before release.
- The same fine-tuning recipe could be applied to language pairs not included in the three combinations studied here.
- General code-analysis tools may need separate training paths for inter-language issues rather than reusing single-language models.
Load-bearing premise
The custom CLB dataset with its three language combinations and nine interaction types accurately represents real-world cross-language bugs and the fine-tuning gains will generalize to other models and data.
What would settle it
Testing the fine-tuned models on an independent collection of cross-language bugs drawn from real open-source multilingual projects and measuring whether the F1 scores remain near 0.74 or drop sharply.
Figures
read the original abstract
Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all 13 CodeLMs exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, within our experimental setup, small CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs' performance was improved, while others showed degraded performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates fine-tuning pre-trained CodeLMs for detecting cross-language bugs (CLBs) that arise from interactions between different programming languages in multilingual projects. It introduces CLCFinder and a custom dataset spanning three PL pairs (Python-C/C++, Java-C/C++, Python-Java) with nine interaction types, then fine-tunes and evaluates 13 CodeLMs. Key results include performance gains for all models after fine-tuning (best F1 0.7407 for UniXcoder-base), better results for smaller models than larger ones in the setup, poor transfer from single-language bug fine-tuning, and analyses of dataset size, token length, and comment effects.
Significance. If the empirical findings hold, the work provides evidence that CodeLMs can be adapted specifically for CLB detection and that CLBs differ meaningfully from single-language bugs. The use of 13 models plus controlled variations in data size and sequence length offers reproducible insights into practical factors affecting performance. These contributions could inform tool development for increasingly common polyglot codebases.
major comments (2)
- [§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.
- [§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.
minor comments (3)
- [Abstract] Abstract: 'performe better' is a typo.
- [§5] §5 (Discussion): The varying impact of code comments is reported but not illustrated with concrete examples of how comments interact with the nine interaction types.
- [Related Work] Missing references to prior work on cross-language static analysis or polyglot bug detection tools would help situate the novelty of CLCFinder.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We respond to each major comment below, indicating planned changes to the manuscript where appropriate.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.
Authors: We agree that direct validation against mined real-world polyglot bugs would provide stronger evidence for the proxy quality of our dataset. The nine interaction types were selected based on documented cross-language mismatch patterns in the multilingual software engineering literature, and the mutation templates were crafted to target type coercion and API boundary issues. We did not, however, mine or compare against developer-reported fixes from polyglot repositories. In the revision we will insert a new limitations paragraph in §3 that explicitly discusses the synthetic construction, possible template-induced artifacts, and the distinction from naturally occurring bugs, while outlining future work on mined CLB validation. This addition clarifies the scope of our claims without changing the reported experimental results. revision: partial
-
Referee: [§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.
Authors: We concur that the absence of statistical tests and confidence intervals limits the strength of the size-trend and performance-gain claims. The current version reports only point estimates of F1. In the revised manuscript we will add bootstrap confidence intervals for all 13 models, apply McNemar’s test to assess whether fine-tuning improvements and the small-vs-large model differences are statistically significant, and include a brief error-analysis subsection that categorizes misclassifications for the best-performing model (UniXcoder-base). These additions will be placed in §4 and will use the same train/test splits already described. revision: yes
Circularity Check
No circularity: empirical fine-tuning evaluation on held-out data
full rationale
The paper is a standard empirical ML study: it constructs a CLB dataset from three PL pairs and nine interaction types, fine-tunes 13 CodeLMs, and reports F1 scores on held-out splits. No equations, derivations, or fitted parameters are redefined as predictions. No self-citations serve as load-bearing premises for uniqueness or ansatzes. All reported improvements (e.g., UniXcoder-base F1 0.7407) are direct measurements against external test data rather than quantities forced by the training procedure itself. The central claims remain falsifiable by re-running the fine-tuning on independently collected polyglot bug data.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters
axioms (1)
- domain assumption The nine interaction types in the constructed dataset capture the essential cross-language bugs that occur in real multilingual projects.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We developed CLCFinder... constructed a CLB dataset involving three PL combinations... nine interaction types... fine-tuned 13 CodeLMs... UniXcoder-base achieving the best F1 score (0.7407)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
small CodeLMs tended to perform better than large ones... CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–56
work page 2021
-
[2]
Oussama Ahouzi, Florent Gbelidji, Sylvain Champonnois, Jérémy L’Hour, Pirashanth Ratnamogan, Bérengère Patault, and Morgane Goibert. 2024. Investing in Performance: Fine-tune small models with LLM insights - a CFM case study. https://huggingface.co/blog/cfm-case-study. Accessed: 2024-12-03
work page 2024
-
[3]
Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using static analysis to find bugs. IEEE Software 25, 5 (2008), 22–29
work page 2008
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE) . ACM, 30–39
work page 2021
-
[6]
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, and Baishakhi Ray. 2022. Natgen: gen- erative pre-training by “naturalizing” source code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 18–30
work page 2022
-
[7]
Edmund M Clarke, E Allen Emerson, and Joseph Sifakis. 2009. Model checking: algorithmic verification and debugging. Commun. ACM 52, 11 (2009), 74–84
work page 2009
-
[8]
CodeParrot. 2021. GitHub Code Clean Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github- code-clean Accessed: 2024-11-09
work page 2021
-
[9]
CodeParrot. 2021. GitHub Code Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github-code Accessed: 2024-11-01
work page 2021
-
[10]
Universal Ctags Contributors. 2024. Universal Ctags - A Source Code Tagging Tool. https://github.com/universal- ctags/ctags. Accessed: 2024-11-10
work page 2024
-
[11]
Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL) . ACM, 238–252
work page 1977
-
[12]
Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. 2008. DySy: Dynamic symbolic execution for invariant inference. In Proceedings of the 30th International Conference on Software Engineering (ICSE) . ACM, 281–290
work page 2008
-
[13]
Jiehan Deng, Lu Lu, and Shaojian Qiu. 2020. Software defect prediction via LSTM. IET Software 14, 4 (2020), 443–450
work page 2020
-
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 23rd Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) . ACL, 4171–4186
work page 2019
-
[15]
Facebook. 2013. Infer. https://fbinfer.com/ Accessed: 2024-11-09
work page 2013
-
[16]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Langu...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[17]
Görkem Giray, Kwabena Ebo Bennin, Ömer Köksal, Önder Babur, and Bedir Tekinerdogan. 2023. On the use of deep learning in software defect prediction. Journal of Systems and Software 195 (2023), 111537
work page 2023
-
[18]
Google Cloud and GitHub. 2021. GitHub Public Dataset on Google BigQuery. https://cloud.google.com/bigquery/public- data/github Accessed: 2024-11-09
work page 2021
-
[19]
Anjana Gosain and Ganga Sharma. 2015. A survey of dynamic program analysis techniques and tools. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) . Springer, 113–122
work page 2015
-
[20]
Manel Grichi, Mouna Abidi, Fehmi Jaafar, Ellis E Eghan, and Bram Adams. 2020. On the impact of interlanguage depen- dencies in multilanguage systems empirical case study on java native interface applications (JNI). IEEE Transactions on Reliability 70, 1 (2020), 428–440
work page 2020
-
[21]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022)
work page internal anchor Pith review arXiv 2022
-
[22]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[23]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the potential of chatgpt in automated code refinement: An empirical study. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13
work page 2024
-
[25]
Yuejun Guo, Seifeddine Bettaieb, and Fran Casino. 2024. A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead. International Journal of Information Security 23, 5 (2024), 3311–3327
work page 2024
-
[26]
David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN Notices 39, 12 (2004), 92–106
work page 2004
-
[27]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023)
-
[28]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
-
[29]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report. arXiv preprint arXiv:2409.12186 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[32]
Sungjae Hwang, Sungho Lee, Jihoon Kim, and Sukyoung Ryu. 2021. Justgen: Effective test generation for unspecified JNI behaviors on jvms. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE) . IEEE, 1708–1718
work page 2021
-
[33]
Sungjae Hwang, Sungho Lee, and Sukyoung Ryu. 2024. An Empirical Study of JVMs’ Behaviors on Erroneous JNI Interoperations. IEEE Transactions on Software Engineering 50, 4 (2024), 979–994
work page 2024
-
[34]
G. Inc. [n. d.]. Errorprone. https://errorprone.info/ Accessed: 2024-11-01
work page 2024
- [35]
-
[36]
Nasraldeen Alnor Adam Khleel and Károly Nehéz. 2023. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. Journal of Intelligent Information Systems 60, 3 (2023), 673–707
work page 2023
-
[37]
James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394
work page 1976
-
[38]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach. Proceedings of the ACM on Programming Languages 8, OOPSLA1 (2024), 474–499
work page 2024
-
[39]
Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the 3rd IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 318–328
work page 2017
-
[40]
Wen Li, Li Li, and Haipeng Cai. 2022. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 847–859
work page 2022
-
[41]
Wen Li, Li Li, and Haipeng Cai. 2022. PolyFax: A toolkit for characterizing multi-language software. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 1662–1666
work page 2022
-
[42]
Wen Li, Austin Marino, Haoran Yang, Na Meng, Li Li, and Haipeng Cai. 2024. How are multilingual systems constructed: Characterizing language use and selection in open-source multilingual software. ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 1–46. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: Septem...
work page 2024
-
[43]
Wen Li, Jiang Ming, Xiapu Luo, and Haipeng Cai. 2022. PolyCruise: A cross-language dynamic information flow analysis. In Proceedings of the 31st USENIX Security Symposium (USENIX Security) . USENIX Association, 2513–2530
work page 2022
-
[44]
Zengyang Li, Guangzong Cai, Qinyi Yu, Peng Liang, Ran Mo, and Hui Liu. 2024. Bug priority change: An empirical study on Apache projects. Journal of Systems and Software 212 (2024), 112019
work page 2024
-
[45]
Zengyang Li, Jiabao Ji, Peng Liang, Ran Mo, and Hui Liu. 2024. An exploratory study on just-in-time multi-programming- language bug prediction. Information and Software Technology 175 (2024), 107524
work page 2024
-
[46]
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, and Yutao Ma. 2025. Replication Package of the Paper “Fine-Tuning Code Language Models to Detect Cross-Language Bugs”
work page 2025
-
[47]
Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang, Ran Mo, and Bing Li. 2023. Understanding bugs in multi- language deep learning frameworks. In Proceedings of the 31st IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 328–338
work page 2023
-
[48]
Zengyang Li, Wenshuo Wang, Sicheng Wang, Peng Liang, and Ran Mo. 2023. Understanding Resolution of Multi- Language Bugs: An Empirical Study on Apache Projects. In Proceedings of the 17th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–11
work page 2023
-
[49]
Jingyu Liu, Jun Ai, Minyan Lu, Jie Wang, and Haoxiang Shi. 2023. Semantic feature learning for software defect prediction from source code and external knowledge. Journal of Systems and Software 204 (2023), 111753
work page 2023
-
[50]
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [51]
-
[52]
Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al. 2024. Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324 (2024)
-
[53]
Ran Mo, Shaozhi Wei, Qiong Feng, and Zengyang Li. 2022. An exploratory study of bug prediction at the method level. Information and Software Technology 144 (2022), 106794
work page 2022
-
[54]
Jihee Park, Sungho Lee, Jaemin Hong, and Sukyoung Ryu. 2023. Static analysis of jni programs via binary decompilation. IEEE Transactions on Software Engineering 49, 5 (2023), 3089–3105
work page 2023
-
[55]
Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. 2021. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)
-
[56]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-training. Technical Report 2018-06-11. OpenAI
work page 2018
-
[57]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67
work page 2020
-
[58]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [59]
-
[60]
SpotBugs. 2021. SpotBugs. https://spotbugs.github.io/ Accessed: 2024-11-09
work page 2021
-
[61]
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13
work page 2024
-
[62]
Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 382–394
work page 2022
-
[63]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)
work page internal anchor Pith review arXiv 2023
-
[64]
Yushuo Wang, Ran Mo, and Yao Zhang. 2024. Machine Learning-based Models for Predicting Defective Packages. In Proceedings of the 8th International Conference on Machine Learning and Soft Computing (ICMLSC) . ACM, 25–31
work page 2024
-
[65]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Language Models to Detect Cross-Langua...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[66]
Kittisak Wongpheng and Porawat Visutsak. 2020. Software defect prediction using convolutional neural network. In Proceedings of the 35th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, 240–243
work page 2020
-
[67]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1482–1494
work page 2023
- [68]
-
[69]
Haoran Yang, Yu Nong, Tao Zhang, Xiapu Luo, and Haipeng Cai. 2024. Learning to Detect and Localize Multilingual Bugs. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2190–2213
work page 2024
- [70]
-
[71]
Beiqi Zhang, Peng Liang, Xin Zhou, Xiyu Zhou, David Lo, Qiong Feng, Zengyang Li, and Lin Li. 2024. A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Method-Level Code Smell Detection. arXiv preprint arXiv:2412.13801 (2024). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.