pith. the verified trust layer for science. sign in

arxiv: 2507.21954 · v2 · submitted 2025-07-29 · 💻 cs.SE · cs.AI

Fine-Tuning Code Language Models to Detect Cross-Language Bugs

Pith reviewed 2026-05-19 02:38 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords cross-language bugscode language modelsfine-tuningbug detectionmultilingual programmingPythonJavaC/C++
0
0 comments X p. Extension

The pith

Fine-tuning code language models on a dataset of cross-language bugs enables better detection of errors from interactions between different programming languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pre-trained code language models can be adapted to find bugs that appear only when code written in multiple languages runs together in one project. It builds a new dataset covering three language pairs and nine interaction types, then fine-tunes thirteen existing models on examples labeled as buggy or clean. Every model improves after this step, and models that were instead fine-tuned only on bugs from one language do poorly on the mixed-language cases. Dataset size helps results while longer code sequences do not always help, and the effect of comments varies by model. The effort addresses a gap because multilingual projects are now common yet most bug detectors still treat each language in isolation.

Core claim

We constructed a CLB dataset covering Python-C/C++, Java-C/C++, and Python-Java combinations along with nine interaction types, then fine-tuned 13 CodeLMs to classify cross-language code as containing bugs or not. All 13 models showed performance gains after fine-tuning, with UniXcoder-base reaching the highest F1 score of 0.7407. Models fine-tuned on single-language bug data performed poorly on CLB detection, indicating that cross-language bugs differ from single-language ones. Larger fine-tuning datasets improved results, longer token sequences did not necessarily help, and code comments produced mixed effects across models. Smaller CodeLMs tended to perform better than larger ones in the

What carries the argument

The custom CLB dataset built from three programming-language pairs and nine interaction types, which is used to fine-tune CodeLMs for binary classification of cross-language code snippets as buggy or non-buggy.

If this is right

  • Larger fine-tuning datasets produce significantly higher detection performance.
  • Longer token sequence lengths do not necessarily raise model performance.
  • Code comments can raise or lower performance depending on the particular CodeLM.
  • Models fine-tuned only on single-language bugs remain ineffective for cross-language cases.
  • Smaller CodeLMs can reach higher performance than larger ones under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams maintaining multilingual projects could add fine-tuned CodeLMs to existing test suites to catch interaction bugs before release.
  • The same fine-tuning recipe could be applied to language pairs not included in the three combinations studied here.
  • General code-analysis tools may need separate training paths for inter-language issues rather than reusing single-language models.

Load-bearing premise

The custom CLB dataset with its three language combinations and nine interaction types accurately represents real-world cross-language bugs and the fine-tuning gains will generalize to other models and data.

What would settle it

Testing the fine-tuned models on an independent collection of cross-language bugs drawn from real open-source multilingual projects and measuring whether the F1 scores remain near 0.74 or drop sharply.

Figures

Figures reproduced from arXiv: 2507.21954 by Binbin Huang, Hui Liu, Peng Liang, Ran Mo, Yimeng Li, Yutao ma, Zengyang Li.

Figure 1
Figure 1. Figure 1: Data collection process We saved the information for repositories meeting these criteria and manually reviewed each repository’s description to filter out non-software repositories, such as those used for educational resources or personal static websites. Ultimately, we collected 1,696 repositories that were verified and used for subsequent analysis. Step 2: Filter Bug-related Issues. Given a repository fr… view at source ↗
Figure 2
Figure 2. Figure 2: Details of our dataset libraries - such as .so files on Linux and .dll files on Windows - rather than directly including C/C++ source code, we limit our analysis to instances where Python and Java invoke C/C++ libraries, as well as cases where Java invokes Python, in order to streamline tool implementation and enhance efficiency. All data samples were collected from open-source projects on GitHub, covering… view at source ↗
Figure 3
Figure 3. Figure 3: The line count distribution of the CLB dataset with the unit of measurement being lines [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The token sequence length distribution of tokenization results produced by different models’ tokenizers [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of CodeLMs in detecting CLBs with different dataset sizes [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of CodeLMs in detecting CLBs with different token sequence lengths [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all 13 CodeLMs exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, within our experimental setup, small CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs' performance was improved, while others showed degraded performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper investigates fine-tuning pre-trained CodeLMs for detecting cross-language bugs (CLBs) that arise from interactions between different programming languages in multilingual projects. It introduces CLCFinder and a custom dataset spanning three PL pairs (Python-C/C++, Java-C/C++, Python-Java) with nine interaction types, then fine-tunes and evaluates 13 CodeLMs. Key results include performance gains for all models after fine-tuning (best F1 0.7407 for UniXcoder-base), better results for smaller models than larger ones in the setup, poor transfer from single-language bug fine-tuning, and analyses of dataset size, token length, and comment effects.

Significance. If the empirical findings hold, the work provides evidence that CodeLMs can be adapted specifically for CLB detection and that CLBs differ meaningfully from single-language bugs. The use of 13 models plus controlled variations in data size and sequence length offers reproducible insights into practical factors affecting performance. These contributions could inform tool development for increasingly common polyglot codebases.

major comments (2)
  1. [§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.
  2. [§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.
minor comments (3)
  1. [Abstract] Abstract: 'performe better' is a typo.
  2. [§5] §5 (Discussion): The varying impact of code comments is reported but not illustrated with concrete examples of how comments interact with the nine interaction types.
  3. [Related Work] Missing references to prior work on cross-language static analysis or polyglot bug detection tools would help situate the novelty of CLCFinder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We respond to each major comment below, indicating planned changes to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Construction) and §4 (Experiments): The central claim that fine-tuning yields consistent CLB detection gains (and that single-language fine-tuning fails to transfer) depends on the nine interaction types forming a faithful proxy for real-world cross-language mismatches. The manuscript provides no validation against mined polyglot bugs with developer-reported fixes or analysis of whether the templates/mutations introduce detectable artifacts rather than genuine type-coercion or API-boundary errors.

    Authors: We agree that direct validation against mined real-world polyglot bugs would provide stronger evidence for the proxy quality of our dataset. The nine interaction types were selected based on documented cross-language mismatch patterns in the multilingual software engineering literature, and the mutation templates were crafted to target type coercion and API boundary issues. We did not, however, mine or compare against developer-reported fixes from polyglot repositories. In the revision we will insert a new limitations paragraph in §3 that explicitly discusses the synthetic construction, possible template-induced artifacts, and the distinction from naturally occurring bugs, while outlining future work on mined CLB validation. This addition clarifies the scope of our claims without changing the reported experimental results. revision: partial

  2. Referee: [§4] §4 (Results and Analysis): The observation that small CodeLMs outperform large ones and the reported F1 scores lack statistical significance tests, confidence intervals, or error analysis across the 13 models. Without these, it is unclear whether the gains and size trend are robust or sensitive to the specific train/test splits and hyperparameter choices.

    Authors: We concur that the absence of statistical tests and confidence intervals limits the strength of the size-trend and performance-gain claims. The current version reports only point estimates of F1. In the revised manuscript we will add bootstrap confidence intervals for all 13 models, apply McNemar’s test to assess whether fine-tuning improvements and the small-vs-large model differences are statistically significant, and include a brief error-analysis subsection that categorizes misclassifications for the best-performing model (UniXcoder-base). These additions will be placed in §4 and will use the same train/test splits already described. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning evaluation on held-out data

full rationale

The paper is a standard empirical ML study: it constructs a CLB dataset from three PL pairs and nine interaction types, fine-tunes 13 CodeLMs, and reports F1 scores on held-out splits. No equations, derivations, or fitted parameters are redefined as predictions. No self-citations serve as load-bearing premises for uniqueness or ansatzes. All reported improvements (e.g., UniXcoder-base F1 0.7407) are direct measurements against external test data rather than quantities forced by the training procedure itself. The central claims remain falsifiable by re-running the fine-tuning on independently collected polyglot bug data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central results rest on the representativeness of the custom dataset and the assumption that fine-tuning transfers effectively to this new bug category; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • fine-tuning hyperparameters
    Learning rate, batch size, epochs, and token length choices are selected to achieve the reported F1 improvements across the 13 models.
axioms (1)
  • domain assumption The nine interaction types in the constructed dataset capture the essential cross-language bugs that occur in real multilingual projects.
    The paper builds and evaluates on this dataset without external validation against production codebases mentioned in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1432 out tokens · 36166 ms · 2026-05-19T02:38:53.626554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 12 internal anchors

  1. [1]

    Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–56

  2. [2]

    Oussama Ahouzi, Florent Gbelidji, Sylvain Champonnois, Jérémy L’Hour, Pirashanth Ratnamogan, Bérengère Patault, and Morgane Goibert. 2024. Investing in Performance: Fine-tune small models with LLM insights - a CFM case study. https://huggingface.co/blog/cfm-case-study. Accessed: 2024-12-03

  3. [3]

    Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using static analysis to find bugs. IEEE Software 25, 5 (2008), 22–29

  4. [4]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  5. [5]

    Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE) . ACM, 30–39

  6. [6]

    naturalizing

    Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, and Baishakhi Ray. 2022. Natgen: gen- erative pre-training by “naturalizing” source code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 18–30

  7. [7]

    Edmund M Clarke, E Allen Emerson, and Joseph Sifakis. 2009. Model checking: algorithmic verification and debugging. Commun. ACM 52, 11 (2009), 74–84

  8. [8]

    CodeParrot. 2021. GitHub Code Clean Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github- code-clean Accessed: 2024-11-09

  9. [9]

    CodeParrot. 2021. GitHub Code Dataset by CodeParrot. https://huggingface.co/datasets/codeparrot/github-code Accessed: 2024-11-01

  10. [10]

    Universal Ctags Contributors. 2024. Universal Ctags - A Source Code Tagging Tool. https://github.com/universal- ctags/ctags. Accessed: 2024-11-10

  11. [11]

    Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL) . ACM, 238–252

  12. [12]

    Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. 2008. DySy: Dynamic symbolic execution for invariant inference. In Proceedings of the 30th International Conference on Software Engineering (ICSE) . ACM, 281–290

  13. [13]

    Jiehan Deng, Lu Lu, and Shaojian Qiu. 2020. Software defect prediction via LSTM. IET Software 14, 4 (2020), 443–450

  14. [14]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 23rd Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) . ACL, 4171–4186

  15. [15]

    Facebook. 2013. Infer. https://fbinfer.com/ Accessed: 2024-11-09

  16. [16]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Langu...

  17. [17]

    Görkem Giray, Kwabena Ebo Bennin, Ömer Köksal, Önder Babur, and Bedir Tekinerdogan. 2023. On the use of deep learning in software defect prediction. Journal of Systems and Software 195 (2023), 111537

  18. [18]

    Google Cloud and GitHub. 2021. GitHub Public Dataset on Google BigQuery. https://cloud.google.com/bigquery/public- data/github Accessed: 2024-11-09

  19. [19]

    Anjana Gosain and Ganga Sharma. 2015. A survey of dynamic program analysis techniques and tools. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) . Springer, 113–122

  20. [20]

    Manel Grichi, Mouna Abidi, Fehmi Jaafar, Ellis E Eghan, and Bram Adams. 2020. On the impact of interlanguage depen- dencies in multilanguage systems empirical case study on java native interface applications (JNI). IEEE Transactions on Reliability 70, 1 (2020), 428–440

  21. [21]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022)

  22. [22]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

  23. [23]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)

  24. [24]

    Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the potential of chatgpt in automated code refinement: An empirical study. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13

  25. [25]

    Yuejun Guo, Seifeddine Bettaieb, and Fran Casino. 2024. A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead. International Journal of Information Security 23, 5 (2024), 3311–3327

  26. [26]

    David Hovemeyer and William Pugh. 2004. Finding bugs is easy. ACM SIGPLAN Notices 39, 12 (2004), 92–106

  27. [27]

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023)

  28. [28]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  29. [29]

    LoRA: Low-Rank Adaptation of Large Language Models

    Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  30. [30]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report. arXiv preprint arXiv:2409.12186 (2024)

  31. [31]

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)

  32. [32]

    Sungjae Hwang, Sungho Lee, Jihoon Kim, and Sukyoung Ryu. 2021. Justgen: Effective test generation for unspecified JNI behaviors on jvms. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE) . IEEE, 1708–1718

  33. [33]

    Sungjae Hwang, Sungho Lee, and Sukyoung Ryu. 2024. An Empirical Study of JVMs’ Behaviors on Erroneous JNI Interoperations. IEEE Transactions on Software Engineering 50, 4 (2024), 979–994

  34. [34]

    G. Inc. [n. d.]. Errorprone. https://errorprone.info/ Accessed: 2024-11-01

  35. [35]

    Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2023. Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169 (2023)

  36. [36]

    Nasraldeen Alnor Adam Khleel and Károly Nehéz. 2023. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. Journal of Intelligent Information Systems 60, 3 (2023), 673–707

  37. [37]

    James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394

  38. [38]

    Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach. Proceedings of the ACM on Programming Languages 8, OOPSLA1 (2024), 474–499

  39. [39]

    Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the 3rd IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 318–328

  40. [40]

    Wen Li, Li Li, and Haipeng Cai. 2022. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 847–859

  41. [41]

    Wen Li, Li Li, and Haipeng Cai. 2022. PolyFax: A toolkit for characterizing multi-language software. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 1662–1666

  42. [42]

    Wen Li, Austin Marino, Haoran Yang, Na Meng, Li Li, and Haipeng Cai. 2024. How are multilingual systems constructed: Characterizing language use and selection in open-source multilingual software. ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 1–46. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: Septem...

  43. [43]

    Wen Li, Jiang Ming, Xiapu Luo, and Haipeng Cai. 2022. PolyCruise: A cross-language dynamic information flow analysis. In Proceedings of the 31st USENIX Security Symposium (USENIX Security) . USENIX Association, 2513–2530

  44. [44]

    Zengyang Li, Guangzong Cai, Qinyi Yu, Peng Liang, Ran Mo, and Hui Liu. 2024. Bug priority change: An empirical study on Apache projects. Journal of Systems and Software 212 (2024), 112019

  45. [45]

    Zengyang Li, Jiabao Ji, Peng Liang, Ran Mo, and Hui Liu. 2024. An exploratory study on just-in-time multi-programming- language bug prediction. Information and Software Technology 175 (2024), 107524

  46. [46]

    Fine-Tuning Code Language Models to Detect Cross-Language Bugs

    Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, and Yutao Ma. 2025. Replication Package of the Paper “Fine-Tuning Code Language Models to Detect Cross-Language Bugs”

  47. [47]

    Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang, Ran Mo, and Bing Li. 2023. Understanding bugs in multi- language deep learning frameworks. In Proceedings of the 31st IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 328–338

  48. [48]

    Zengyang Li, Wenshuo Wang, Sicheng Wang, Peng Liang, and Ran Mo. 2023. Understanding Resolution of Multi- Language Bugs: An Empirical Study on Apache Projects. In Proceedings of the 17th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–11

  49. [49]

    Jingyu Liu, Jun Ai, Minyan Lu, Jie Wang, and Haoxiang Shi. 2023. Semantic feature learning for software defect prediction from source code and external knowledge. Journal of Systems and Software 204 (2023), 111753

  50. [50]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024)

  51. [51]

    Ian McCormack, Joshua Sunshine, and Jonathan Aldrich. 2024. A study of undefined behavior across foreign function boundaries in Rust libraries. arXiv preprint arXiv:2404.11671 (2024)

  52. [52]

    Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al. 2024. Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324 (2024)

  53. [53]

    Ran Mo, Shaozhi Wei, Qiong Feng, and Zengyang Li. 2022. An exploratory study of bug prediction at the method level. Information and Software Technology 144 (2022), 106794

  54. [54]

    Jihee Park, Sungho Lee, Jaemin Hong, and Sukyoung Ryu. 2023. Static analysis of jni programs via binary decompilation. IEEE Transactions on Software Engineering 49, 5 (2023), 3089–3105

  55. [55]

    Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. 2021. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 (2021)

  56. [56]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-training. Technical Report 2018-06-11. OpenAI

  57. [57]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67

  58. [58]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

  59. [59]

    Honglin Shu, Michael Fu, Junji Yu, Dong Wang, Chakkrit Tantithamthavorn, Junjie Chen, and Yasutaka Kamei. 2025. Large Language Models for Multilingual Vulnerability Detection: How Far Are We? arXiv preprint arXiv:2506.07503 (2025)

  60. [60]

    SpotBugs. 2021. SpotBugs. https://spotbugs.github.io/ Accessed: 2024-11-09

  61. [61]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE) . ACM, 1–13

  62. [62]

    Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) . ACM, 382–394

  63. [63]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)

  64. [64]

    Yushuo Wang, Ran Mo, and Yao Zhang. 2024. Machine Learning-based Models for Predicting Defective Packages. In Proceedings of the 8th International Conference on Machine Learning and Soft Computing (ICMLSC) . ACM, 25–31

  65. [65]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025. Fine-Tuning Code Language Models to Detect Cross-Langua...

  66. [66]

    Kittisak Wongpheng and Porawat Visutsak. 2020. Software defect prediction using convolutional neural network. In Proceedings of the 35th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, 240–243

  67. [67]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  68. [68]

    Borui Xu, Yao Chen, Zeyi Wen, Weiguo Liu, and Bingsheng He. 2025. Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance. arXiv preprint arXiv:2502.00641 (2025)

  69. [69]

    Haoran Yang, Yu Nong, Tao Zhang, Xiapu Luo, and Haipeng Cai. 2024. Learning to Detect and Localize Multilingual Bugs. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2190–2213

  70. [70]

    Boyu Zhang, Triet HM Le, and M Ali Babar. 2024. MVD: A Multi-Lingual Software Vulnerability Detection Framework. arXiv preprint arXiv:2412.06166 (2024)

  71. [71]

    Beiqi Zhang, Peng Liang, Xin Zhou, Xiyu Zhou, David Lo, Qiong Feng, Zengyang Li, and Lin Li. 2024. A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Method-Level Code Smell Detection. arXiv preprint arXiv:2412.13801 (2024). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: September 2025