arxiv: 2507.21954 · v2 · submitted 2025-07-29 · 💻 cs.SE · cs.AI

Fine-Tuning Code Language Models to Detect Cross-Language Bugs

Zengyang Li , Yimeng Li , Binbin Huang , Peng Liang , Ran Mo , Hui Liu , Yutao ma This is my paper

Pith reviewed 2026-05-19 02:38 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords cross-language bugscode language modelsfine-tuningbug detectionmultilingual programmingPythonJavaC/C++

0 comments p. Extension

The pith

Fine-tuning code language models on a dataset of cross-language bugs enables better detection of errors from interactions between different programming languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pre-trained code language models can be adapted to find bugs that appear only when code written in multiple languages runs together in one project. It builds a new dataset covering three language pairs and nine interaction types, then fine-tunes thirteen existing models on examples labeled as buggy or clean. Every model improves after this step, and models that were instead fine-tuned only on bugs from one language do poorly on the mixed-language cases. Dataset size helps results while longer code sequences do not always help, and the effect of comments varies by model. The effort addresses a gap because multilingual projects are now common yet most bug detectors still treat each language in isolation.

Core claim

We constructed a CLB dataset covering Python-C/C++, Java-C/C++, and Python-Java combinations along with nine interaction types, then fine-tuned 13 CodeLMs to classify cross-language code as containing bugs or not. All 13 models showed performance gains after fine-tuning, with UniXcoder-base reaching the highest F1 score of 0.7407. Models fine-tuned on single-language bug data performed poorly on CLB detection, indicating that cross-language bugs differ from single-language ones. Larger fine-tuning datasets improved results, longer token sequences did not necessarily help, and code comments produced mixed effects across models. Smaller CodeLMs tended to perform better than larger ones in the

What carries the argument

The custom CLB dataset built from three programming-language pairs and nine interaction types, which is used to fine-tune CodeLMs for binary classification of cross-language code snippets as buggy or non-buggy.

If this is right

Larger fine-tuning datasets produce significantly higher detection performance.
Longer token sequence lengths do not necessarily raise model performance.
Code comments can raise or lower performance depending on the particular CodeLM.
Models fine-tuned only on single-language bugs remain ineffective for cross-language cases.
Smaller CodeLMs can reach higher performance than larger ones under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams maintaining multilingual projects could add fine-tuned CodeLMs to existing test suites to catch interaction bugs before release.
The same fine-tuning recipe could be applied to language pairs not included in the three combinations studied here.
General code-analysis tools may need separate training paths for inter-language issues rather than reusing single-language models.

Load-bearing premise

The custom CLB dataset with its three language combinations and nine interaction types accurately represents real-world cross-language bugs and the fine-tuning gains will generalize to other models and data.

What would settle it

Testing the fine-tuned models on an independent collection of cross-language bugs drawn from real open-source multilingual projects and measuring whether the F1 scores remain near 0.74 or drop sharply.

Figures

Figures reproduced from arXiv: 2507.21954 by Binbin Huang, Hui Liu, Peng Liang, Ran Mo, Yimeng Li, Yutao ma, Zengyang Li.

**Figure 1.** Figure 1: Data collection process We saved the information for repositories meeting these criteria and manually reviewed each repository’s description to filter out non-software repositories, such as those used for educational resources or personal static websites. Ultimately, we collected 1,696 repositories that were verified and used for subsequent analysis. Step 2: Filter Bug-related Issues. Given a repository fr… view at source ↗

**Figure 2.** Figure 2: Details of our dataset libraries - such as .so files on Linux and .dll files on Windows - rather than directly including C/C++ source code, we limit our analysis to instances where Python and Java invoke C/C++ libraries, as well as cases where Java invokes Python, in order to streamline tool implementation and enhance efficiency. All data samples were collected from open-source projects on GitHub, covering… view at source ↗

**Figure 3.** Figure 3: The line count distribution of the CLB dataset with the unit of measurement being lines [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: The token sequence length distribution of tokenization results produced by different models’ tokenizers [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of CodeLMs in detecting CLBs with different dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of CodeLMs in detecting CLBs with different token sequence lengths [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all 13 CodeLMs exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, within our experimental setup, small CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs' performance was improved, while others showed degraded performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows fine-tuning on a new CLB dataset lifts detection across 13 CodeLMs and that single-language tuning fails to transfer, but the dataset's construction from interaction types is the part that needs scrutiny.

read the letter

The main point is that fine-tuning CodeLMs on their cross-language bug dataset produces consistent gains, topping out at 0.74 F1 with UniXcoder-base, and that models tuned only on single-language bugs do poorly on CLB detection. Small models also came out ahead of larger ones in their setup. They built a dataset covering Python-C/C++, Java-C/C++, and Python-Java with nine interaction types, then ran the fine-tuning and checked effects from dataset size, token length, and comments. That gives a clear empirical signal that CLBs behave differently enough to warrant separate treatment. The work is straightforward and covers a practical gap as multilingual projects become more common. The experiments are broad enough to show the distinction and the benefit of more data. The soft spot sits with the dataset. The abstract describes it as constructed around defined interaction types, but without details on whether the bugs were pulled from real failing polyglot code with developer fixes or generated from templates, it is hard to judge how well the results generalize. If the examples lean artificial, the performance lift and the small-model advantage could be tied to the specific splits rather than real cross-language mismatches like type coercion or API boundaries. No statistical tests or error analysis are mentioned, so the trends stay preliminary. This is for software engineering researchers who work on bug detection tools for mixed-language codebases. A reader who needs a starting dataset or baseline numbers for CLB tasks would get something concrete from it. The paper is coherent on its own terms and engages the literature enough to merit review. I would send it to peer review with requests for more on dataset sourcing and validation.

Referee Report

2 major / 3 minor

Summary. The paper investigates fine-tuning pre-trained CodeLMs for detecting cross-language bugs (CLBs) that arise from interactions between different programming languages in multilingual projects. It introduces CLCFinder and a custom dataset spanning three PL pairs (Python-C/C++, Java-C/C++, Python-Java) with nine interaction types, then fine-tunes and evaluates 13 CodeLMs. Key results include performance gains for all models after fine-tuning (best F1 0.7407 for UniXcoder-base), better results for smaller models than larger ones in the setup, poor transfer from single-language bug fine-tuning, and analyses of dataset size, token length, and comment effects.

Significance. If the empirical findings hold, the work provides evidence that CodeLMs can be adapted specifically for CLB detection and that CLBs differ meaningfully from single-language bugs. The use of 13 models plus controlled variations in data size and sequence length offers reproducible insights into practical factors affecting performance. These contributions could inform tool development for increasingly common polyglot codebases.

major comments (2)

[§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.
[§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.

minor comments (3)

[Abstract] Abstract: 'performe better' is a typo.
[§5] §5 (Discussion): The varying impact of code comments is reported but not illustrated with concrete examples of how comments interact with the nine interaction types.
[Related Work] Missing references to prior work on cross-language static analysis or polyglot bug detection tools would help situate the novelty of CLCFinder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We respond to each major comment below, indicating planned changes to the manuscript where appropriate.

read point-by-point responses

Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.

Authors: We agree that direct validation against mined real-world polyglot bugs would provide stronger evidence for the proxy quality of our dataset. The nine interaction types were selected based on documented cross-language mismatch patterns in the multilingual software engineering literature, and the mutation templates were crafted to target type coercion and API boundary issues. We did not, however, mine or compare against developer-reported fixes from polyglot repositories. In the revision we will insert a new limitations paragraph in §3 that explicitly discusses the synthetic construction, possible template-induced artifacts, and the distinction from naturally occurring bugs, while outlining future work on mined CLB validation. This addition clarifies the scope of our claims without changing the reported experimental results. revision: partial
Referee: [§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.

Authors: We concur that the absence of statistical tests and confidence intervals limits the strength of the size-trend and performance-gain claims. The current version reports only point estimates of F1. In the revised manuscript we will add bootstrap confidence intervals for all 13 models, apply McNemar’s test to assess whether fine-tuning improvements and the small-vs-large model differences are statistically significant, and include a brief error-analysis subsection that categorizes misclassifications for the best-performing model (UniXcoder-base). These additions will be placed in §4 and will use the same train/test splits already described. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning evaluation on held-out data

full rationale

The paper is a standard empirical ML study: it constructs a CLB dataset from three PL pairs and nine interaction types, fine-tunes 13 CodeLMs, and reports F1 scores on held-out splits. No equations, derivations, or fitted parameters are redefined as predictions. No self-citations serve as load-bearing premises for uniqueness or ansatzes. All reported improvements (e.g., UniXcoder-base F1 0.7407) are direct measurements against external test data rather than quantities forced by the training procedure itself. The central claims remain falsifiable by re-running the fine-tuning on independently collected polyglot bug data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central results rest on the representativeness of the custom dataset and the assumption that fine-tuning transfers effectively to this new bug category; no new physical entities or mathematical axioms are introduced.

free parameters (1)

fine-tuning hyperparameters
Learning rate, batch size, epochs, and token length choices are selected to achieve the reported F1 improvements across the 13 models.

axioms (1)

domain assumption The nine interaction types in the constructed dataset capture the essential cross-language bugs that occur in real multilingual projects.
The paper builds and evaluates on this dataset without external validation against production codebases mentioned in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1432 out tokens · 36166 ms · 2026-05-19T02:38:53.626554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We developed CLCFinder... constructed a CLB dataset involving three PL combinations... nine interaction types... fine-tuned 13 CodeLMs... UniXcoder-base achieving the best F1 score (0.7407)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

small CodeLMs tended to perform better than large ones... CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 12 internal anchors

[1]

Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–56

work page 2021
[2]

Oussama Ahouzi, Florent Gbelidji, Sylvain Champonnois, Jérémy L’Hour, Pirashanth Ratnamogan, Bérengère Patault, and Morgane Goibert. 2024. Investing in Performance: Fine-tune small models with LLM insights - a CFM case study. https://huggingface.co/blog/cfm-case-study. Accessed: 2024-12-03

work page 2024
[3]

Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using static analysis to find bugs. IEEE Software 25, 5 (2008), 22–29

work page 2008
[4]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE) . ACM, 30–39

work page 2021
[6]

naturalizing

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, and Baishakhi Ray. 2022. Natgen: gen- erative pre-training by “naturalizing” source code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 18–30

work page 2022
[7]

Edmund M Clarke, E Allen Emerson, and Joseph Sifakis. 2009. Model checking: algorithmic verification and debugging. Commun. ACM 52, 11 (2009), 74–84

work page 2009
[8]

CodeParrot. 2021. GitHub Code Clean Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github- code-clean Accessed: 2024-11-09

work page 2021
[9]

CodeParrot. 2021. GitHub Code Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github-code Accessed: 2024-11-01

work page 2021
[10]

Universal Ctags Contributors. 2024. Universal Ctags - A Source Code Tagging Tool. https://github.com/universal- ctags/ctags. Accessed: 2024-11-10

work page 2024
[11]

Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL) . ACM, 238–252

work page 1977
[12]

Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. 2008. DySy: Dynamic symbolic execution for invariant inference. In Proceedings of the 30th International Conference on Software Engineering (ICSE) . ACM, 281–290

work page 2008
[13]

Jiehan Deng, Lu Lu, and Shaojian Qiu. 2020. Software defect prediction via LSTM. IET Software 14, 4 (2020), 443–450

work page 2020
[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 23rd Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) . ACL, 4171–4186

work page 2019
[15]

Facebook. 2013. Infer. https://fbinfer.com/ Accessed: 2024-11-09

work page 2013
[16]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Langu...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

Görkem Giray, Kwabena Ebo Bennin, Ömer Köksal, Önder Babur, and Bedir Tekinerdogan. 2023. On the use of deep learning in software defect prediction. Journal of Systems and Software 195 (2023), 111537

work page 2023
[18]

Google Cloud and GitHub. 2021. GitHub Public Dataset on Google BigQuery. https://cloud.google.com/bigquery/public- data/github Accessed: 2024-11-09

work page 2021
[19]

Anjana Gosain and Ganga Sharma. 2015. A survey of dynamic program analysis techniques and tools. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) . Springer, 113–122

work page 2015
[20]

Manel Grichi, Mouna Abidi, Fehmi Jaafar, Ellis E Eghan, and Bram Adams. 2020. On the impact of interlanguage depen- dencies in multilanguage systems empirical case study on java native interface applications (JNI). IEEE Transactions on Reliability 70, 1 (2020), 428–440

work page 2020
[21]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022)

work page internal anchor Pith review arXiv 2022
[22]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the potential of chatgpt in automated code refinement: An empirical study. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13

work page 2024
[25]

Yuejun Guo, Seifeddine Bettaieb, and Fran Casino. 2024. A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead. International Journal of Information Security 23, 5 (2024), 3311–3327

work page 2024
[26]

David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN Notices 39, 12 (2004), 92–106

work page 2004
[27]

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023)

work page arXiv 2023
[28]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

work page
[29]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[32]

Sungjae Hwang, Sungho Lee, Jihoon Kim, and Sukyoung Ryu. 2021. Justgen: Effective test generation for unspecified JNI behaviors on jvms. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE) . IEEE, 1708–1718

work page 2021
[33]

Sungjae Hwang, Sungho Lee, and Sukyoung Ryu. 2024. An Empirical Study of JVMs’ Behaviors on Erroneous JNI Interoperations. IEEE Transactions on Software Engineering 50, 4 (2024), 979–994

work page 2024
[34]

G. Inc. [n. d.]. Errorprone. https://errorprone.info/ Accessed: 2024-11-01

work page 2024
[35]

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2023. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169 (2023)

work page arXiv 2023
[36]

Nasraldeen Alnor Adam Khleel and Károly Nehéz. 2023. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. Journal of Intelligent Information Systems 60, 3 (2023), 673–707

work page 2023
[37]

James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394

work page 1976
[38]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach. Proceedings of the ACM on Programming Languages 8, OOPSLA1 (2024), 474–499

work page 2024
[39]

Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the 3rd IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 318–328

work page 2017
[40]

Wen Li, Li Li, and Haipeng Cai. 2022. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 847–859

work page 2022
[41]

Wen Li, Li Li, and Haipeng Cai. 2022. PolyFax: A toolkit for characterizing multi-language software. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 1662–1666

work page 2022
[42]

Wen Li, Austin Marino, Haoran Yang, Na Meng, Li Li, and Haipeng Cai. 2024. How are multilingual systems constructed: Characterizing language use and selection in open-source multilingual software. ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 1–46. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: Septem...

work page 2024
[43]

Wen Li, Jiang Ming, Xiapu Luo, and Haipeng Cai. 2022. PolyCruise: A cross-language dynamic information flow analysis. In Proceedings of the 31st USENIX Security Symposium (USENIX Security) . USENIX Association, 2513–2530

work page 2022
[44]

Zengyang Li, Guangzong Cai, Qinyi Yu, Peng Liang, Ran Mo, and Hui Liu. 2024. Bug priority change: An empirical study on Apache projects. Journal of Systems and Software 212 (2024), 112019

work page 2024
[45]

Zengyang Li, Jiabao Ji, Peng Liang, Ran Mo, and Hui Liu. 2024. An exploratory study on just-in-time multi-programming- language bug prediction. Information and Software Technology 175 (2024), 107524

work page 2024
[46]

Fine-Tuning Code Language Models to Detect Cross-Language Bugs

Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, and Yutao Ma. 2025. Replication Package of the Paper “Fine-Tuning Code Language Models to Detect Cross-Language Bugs”

work page 2025
[47]

Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang, Ran Mo, and Bing Li. 2023. Understanding bugs in multi- language deep learning frameworks. In Proceedings of the 31st IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 328–338

work page 2023
[48]

Zengyang Li, Wenshuo Wang, Sicheng Wang, Peng Liang, and Ran Mo. 2023. Understanding Resolution of Multi- Language Bugs: An Empirical Study on Apache Projects. In Proceedings of the 17th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–11

work page 2023
[49]

Jingyu Liu, Jun Ai, Minyan Lu, Jie Wang, and Haoxiang Shi. 2023. Semantic feature learning for software defect prediction from source code and external knowledge. Journal of Systems and Software 204 (2023), 111753

work page 2023
[50]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Ian McCormack, Joshua Sunshine, and Jonathan Aldrich. 2024. A study of undefined behavior across foreign function boundaries in Rust libraries. arXiv preprint arXiv:2404.11671 (2024)

work page arXiv 2024
[52]

Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al. 2024. Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324 (2024)

work page arXiv 2024
[53]

Ran Mo, Shaozhi Wei, Qiong Feng, and Zengyang Li. 2022. An exploratory study of bug prediction at the method level. Information and Software Technology 144 (2022), 106794

work page 2022
[54]

Jihee Park, Sungho Lee, Jaemin Hong, and Sukyoung Ryu. 2023. Static analysis of jni programs via binary decompilation. IEEE Transactions on Software Engineering 49, 5 (2023), 3089–3105

work page 2023
[55]

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. 2021. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)

work page arXiv 2021
[56]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-training. Technical Report 2018-06-11. OpenAI

work page 2018
[57]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67

work page 2020
[58]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Honglin Shu, Michael Fu, Junji Yu, Dong Wang, Chakkrit Tantithamthavorn, Junjie Chen, and Yasutaka Kamei. 2025. Large Language Models for Multilingual Vulnerability Detection: How Far Are We? arXiv preprint arXiv:2506.07503 (2025)

work page arXiv 2025
[60]

SpotBugs. 2021. SpotBugs. https://spotbugs.github.io/ Accessed: 2024-11-09

work page 2021
[61]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13

work page 2024
[62]

Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 382–394

work page 2022
[63]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)

work page internal anchor Pith review arXiv 2023
[64]

Yushuo Wang, Ran Mo, and Yao Zhang. 2024. Machine Learning-based Models for Predicting Defective Packages. In Proceedings of the 8th International Conference on Machine Learning and Soft Computing (ICMLSC) . ACM, 25–31

work page 2024
[65]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Language Models to Detect Cross-Langua...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

Kittisak Wongpheng and Porawat Visutsak. 2020. Software defect prediction using convolutional neural network. In Proceedings of the 35th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, 240–243

work page 2020
[67]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1482–1494

work page 2023
[68]

Borui Xu, Yao Chen, Zeyi Wen, Weiguo Liu, and Bingsheng He. 2025. Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance. arXiv preprint arXiv:2502.00641 (2025)

work page arXiv 2025
[69]

Haoran Yang, Yu Nong, Tao Zhang, Xiapu Luo, and Haipeng Cai. 2024. Learning to Detect and Localize Multilingual Bugs. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2190–2213

work page 2024
[70]

Boyu Zhang, Triet HM Le, and M Ali Babar. 2024. MVD: A Multi-Lingual Software Vulnerability Detection Framework. arXiv preprint arXiv:2412.06166 (2024)

work page arXiv 2024
[71]

Beiqi Zhang, Peng Liang, Xin Zhou, Xiyu Zhou, David Lo, Qiong Feng, Zengyang Li, and Lin Li. 2024. A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Method-Level Code Smell Detection. arXiv preprint arXiv:2412.13801 (2024). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025

work page arXiv 2024