Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?

Soumit Kanti Saha , Fazle Rabbi , Song Wang , Jinqiu Yang

Authors on Pith no claims yet

classification 💻 cs.SE

keywords codelanguagetranslationlanguagesllmshoweverintermediateinvestigate

read the original abstract

Large Language Models (LLMs) are increasingly being applied across various domains, including code-related tasks such as code translation. Previous studies have explored using LLMs for translating code between different programming languages. Since LLMs are more effective with natural language, using natural language as an intermediate representation in code translation tasks is an intuitively appealing approach. However, whether this benefit is general or highly context-dependent remains unclear. In this work, we investigate using NL-specification as an intermediate representation for code translation. We evaluate our method using three datasets, five popular programming languages, and 29 language pair permutations. Our results show that using NL-specification alone does not lead to performance improvements. However, when combined with source code, it provides gains in certain language pairs (notably with Python and C++ as source languages), while offering no consistent improvement overall. Besides analyzing the performance of code translation, we also investigate the quality of the translated code and provide insights into the issues present in the translated code.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
cs.SE 2026-05 accept novelty 7.0

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 7.0

Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
cs.SE 2026-05 unverdicted novelty 5.0

A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.