pith. sign in

arxiv: 2605.15607 · v1 · pith:3VITAVFOnew · submitted 2026-05-15 · 💻 cs.CL · cs.LG

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

Pith reviewed 2026-05-20 19:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords large language modelscode generationunseen languagessyntax learningsemantic transferimplementation gapfine-tuning
0
0 comments X

The pith

Fine-tuning teaches LLMs the syntax of an unseen programming language but fails to transfer the ability to produce correct code in it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PyLang, a minimal imperative language designed to be absent from all pretraining data, and tests frontier models on 352 problems both zero-shot and after fine-tuning. Fine-tuning rapidly imparts the new syntax, yet models consistently underperform on PyLang compared with identical problems in Python, with gaps reaching 19 percent even after multi-task learning, preference tuning, code infilling, or latent-space objectives. An LLM judge finds that models select the same algorithm in 80 percent of cases but cannot realize it as executable PyLang, while CKA similarity analysis shows nearly identical internal representations across languages that diverge only at the output stage. The authors term this separation the implementation fidelity gap and conclude that current methods leave algorithmic understanding language-agnostic while realization remains language-specific.

Core claim

Fine-tuning on PyLang quickly teaches its syntax yet leaves models unable to map their language-agnostic algorithmic understanding into working implementations, producing an implementation fidelity gap in which internal representations converge across languages (CKA > 0.97) while output performance diverges, with Python outperforming by up to 19 percent across all tested interventions.

What carries the argument

The implementation fidelity gap, the separation between language-agnostic algorithmic selection and language-specific code realization that persists despite high internal representation similarity.

Load-bearing premise

That PyLang truly never appeared in any pretraining corpus and that the 352 problems keep identical difficulty and logical structure when rewritten from Python into PyLang.

What would settle it

An experiment in which a fine-tuned model reaches equal pass rates on matched PyLang and Python problems, or direct evidence that PyLang fragments existed in the original training data.

read the original abstract

Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PyLang, a minimal imperative language absent from pretraining corpora, and evaluates zero-shot and fine-tuned Qwen3 models (4B/8B/32B) on 352 coding problems. It claims that fine-tuning rapidly teaches syntax but fails to transfer semantic competence, with Python outperforming PyLang by up to 19% across configurations; interventions including multi-task learning, preference tuning, code infilling, and latent-space objectives do not close the gap. Supporting measurements include an LLM judge reporting 80% identical algorithm selection and CKA similarity >0.97 between languages, leading to the proposed 'implementation fidelity gap' where models possess language-agnostic algorithmic understanding but cannot realize it in an unfamiliar syntax.

Significance. If substantiated, the result would be significant for understanding limitations in LLM code generation for novel or low-resource languages, showing that current fine-tuning and alignment methods do not suffice to bridge syntax acquisition to semantic realization. The empirical design with multiple interventions and internal representation analysis via CKA provides a solid foundation for the central claim and highlights the need for training paradigms that better decouple reasoning from language-specific output.

major comments (3)
  1. [§3.2] §3.2 (problem set construction): The translation process from the 352 Python problems to PyLang is not accompanied by reported controls for equivalent difficulty or structure, such as counts of control-flow constructs, output-length statistics, or human-rated difficulty scores. This is load-bearing for the central claim because the 19% gap and failed interventions could arise from translation artifacts rather than an implementation fidelity gap.
  2. [§4.3] §4.3 (intervention experiments): The descriptions of multi-task learning, preference tuning, code infilling, and latent-space objectives lack hyperparameter details, training curves, or ablation results showing why each intervention was insufficient. Without these, it remains unclear whether the persistent gap is fundamental or could be mitigated by more extensive tuning within the manuscript's scope.
  3. [§5.1] §5.1 (LLM judge and CKA): The 80% algorithm-match rate from the LLM judge is presented without the judge prompt, inter-annotator agreement, or human validation; likewise the CKA > 0.97 result does not specify the layers or representation pairs compared. These omissions weaken the support for language-agnostic algorithmic understanding.
minor comments (2)
  1. [Abstract] Abstract: The sentence containing the CKA result has an extraneous comma ('implementation., and CKA') that should be corrected for readability.
  2. [§2] §2 (Related Work): Additional citations to prior studies on code generation for constructed or low-resource languages would strengthen the positioning of PyLang.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments have identified valuable opportunities to improve transparency and robustness, particularly around experimental controls and details. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (problem set construction): The translation process from the 352 Python problems to PyLang is not accompanied by reported controls for equivalent difficulty or structure, such as counts of control-flow constructs, output-length statistics, or human-rated difficulty scores. This is load-bearing for the central claim because the 19% gap and failed interventions could arise from translation artifacts rather than an implementation fidelity gap.

    Authors: We agree that explicit controls would strengthen the equivalence argument. The problems are direct translations of the original set, preserving algorithmic structure by design. In the revised manuscript, we will add counts of control-flow constructs (loops, conditionals, etc.) and output-length statistics for both languages. Human-rated difficulty scores were not collected in the original study due to resource limits; we will instead add a discussion clarifying that translation fidelity was verified through manual inspection of a sample. We maintain that the persistent gap across interventions and model scales supports the implementation fidelity gap rather than translation artifacts. revision: partial

  2. Referee: [§4.3] §4.3 (intervention experiments): The descriptions of multi-task learning, preference tuning, code infilling, and latent-space objectives lack hyperparameter details, training curves, or ablation results showing why each intervention was insufficient. Without these, it remains unclear whether the persistent gap is fundamental or could be mitigated by more extensive tuning within the manuscript's scope.

    Authors: We will expand §4.3 and add a dedicated appendix section with full hyperparameter configurations for each intervention (learning rates, batch sizes, epochs, etc.). We will also include representative training curves and additional ablation results demonstrating the range of tuning explored. These additions will show that the gap persisted despite systematic variation, supporting our claim that current methods do not suffice to bridge the syntax-semantics divide within practical compute budgets. revision: yes

  3. Referee: [§5.1] §5.1 (LLM judge and CKA): The 80% algorithm-match rate from the LLM judge is presented without the judge prompt, inter-annotator agreement, or human validation; likewise the CKA > 0.97 result does not specify the layers or representation pairs compared. These omissions weaken the support for language-agnostic algorithmic understanding.

    Authors: We will include the complete LLM judge prompt in the appendix for reproducibility. We will specify that CKA was computed on final-layer hidden states for aligned token positions across languages. Although a single automated judge was used (precluding traditional inter-annotator agreement), we will add human validation results on a 50-problem subset to corroborate the 80% algorithm-match rate. These details will be incorporated to better substantiate the language-agnostic algorithmic understanding claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of pass rates and representations are independent of any fitted derivation.

full rationale

The paper is an empirical study that introduces PyLang as a novel language, runs zero-shot and fine-tuned evaluations on 352 problems, and reports direct metrics such as pass rates (Python outperforming PyLang by up to 19%), LLM judge agreement (80% identical algorithms), and CKA similarity (>0.97). No equations, first-principles derivations, or predictions are defined in terms of quantities fitted to the same data; the central claim about the implementation fidelity gap follows from these independent experimental observations rather than reducing to self-definitional constructs or self-citation chains. The work is therefore self-contained against external benchmarks of model performance.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on PyLang being completely unseen and on the problems being comparable across languages. These are domain assumptions rather than derived results. No free parameters or invented physical entities are introduced; the new term 'implementation fidelity gap' is a descriptive label for observed behavior.

axioms (2)
  • domain assumption PyLang is absent from all pretraining corpora
    Explicitly stated in the abstract as the basis for testing zero-shot and fine-tuned transfer.
  • domain assumption The 352 problems are directly comparable in difficulty and structure between Python and PyLang
    Required for attributing performance differences to semantic rather than task-design factors.
invented entities (1)
  • implementation fidelity gap no independent evidence
    purpose: Descriptive term for the observed mismatch between algorithmic choice and correct implementation in the unseen language
    Coined in the paper to summarize the judge and CKA findings; no independent falsifiable prediction is provided for the entity itself.

pith-pipeline@v0.9.0 · 5763 in / 1682 out tokens · 71322 ms · 2026-05-20T19:08:15.389133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  2. [2]

    Patel, Parth Sheth, et al

    Sanjay Basu, Sadiq Y. Patel, Parth Sheth, et al. Interpretability without actionability: Mecha- nistic methods cannot correct language model errors despite near-perfect internal repre- sentations.arXiv preprint arXiv:2603.18353,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,

    Anmol Gupta, Tushar Kataria, and Nasser Nasrabadi. Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,

  5. [5]

    Idea first, code later: Disentangling problem solving from code generation in evaluating LLMs for competitive programming

    Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, et al. Idea first, code later: Disentangling problem solving from code generation in evaluating LLMs for competitive programming. arXiv preprint arXiv:2601.11332,

  6. [6]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  7. [7]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  8. [8]

    EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

    Aman Sharma and Paras Chopra. EsoLang-Bench: Evaluating genuine reasoning in large language models via esoteric programming languages.arXiv preprint arXiv:2603.09678,

  9. [9]

    Bridging the knowledge void: Inference- time acquisition of unfamiliar programming languages for coding tasks.arXiv preprint arXiv:2602.06976,

    Chen Shen, Wei Cheng, Jingyue Yang, et al. Bridging the knowledge void: Inference- time acquisition of unfamiliar programming languages for coding tasks.arXiv preprint arXiv:2602.06976,

  10. [10]

    Planning in natural language improves LLM search for code generation.arXiv preprint arXiv:2409.03733,

    Evan Wang, Federico Cassano, Catherine Wu, et al. Planning in natural language improves LLM search for code generation.arXiv preprint arXiv:2409.03733,

  11. [11]

    M2G-Eval: Enhancing and evaluating multi- granularity multilingual code generation.arXiv preprint arXiv:2512.22628,

    Fanglin Xu, Wei Zhang, Jian Yang, et al. M2G-Eval: Enhancing and evaluating multi- granularity multilingual code generation.arXiv preprint arXiv:2512.22628,

  12. [12]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

  13. [13]

    Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980,

    Zhe Yin, Xiaodong Gu, and Beijun Shen. Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980,

  14. [14]

    Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

    Zihan Zheng, Zerui Cheng, Zeyu Shen, et al. LiveCodeBench Pro: How do olympiad medalists judge LLMs in competitive programming?arXiv preprint arXiv:2506.11928,

  15. [15]

    and pattern-matching exploitation Gupta et al. (2024). More challenging benchmarks include LiveCodeBench (Jain et al., 2025), SWE-bench (Jimenez et al., 2024), and LiveCodeBench Pro (Zheng et al., 2025), where Olympiad medalists annotate problems and find that frontier models still score 0% on hard problems, succeeding primarily on implementation-heavy ta...

  16. [16]

    "; i = 0; line_count = 0; while (i < len(input)) { if (input[i] ==

    introduced multi- granularity evaluation across 18 languages, finding strong cross-language correlations that suggest models learn transferable programming concepts. Our work differs from all of these by evaluating thesameproblems across two languages, one known, one unseen, to directly isolate what fine-tuning contributes beyond pretraining. Cross-Lingua...