pith. sign in

arxiv: 2605.21537 · v1 · pith:CVNGGTNCnew · submitted 2026-05-20 · 💻 cs.SE

Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization

Pith reviewed 2026-05-22 01:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM code modernizationself-review failuresemantic driftPython 2 to 3 migrationbehavioral oraclelegacy codemodel self-evaluation
0
0 comments X

The pith

LLMs that modernize legacy code often endorse their own changes that alter observable behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether an LLM that rewrites legacy code can reliably detect when its own rewrite changes what the program actually does. It performs nearly two thousand modernization runs across eleven models on a controlled set of Python 2 snippets, measures real behavior changes with an oracle, and then asks each model to review its own output for preservation. A reader should care because developers increasingly use these models for migration work and may treat the model's self-check as a safeguard, yet the study shows that safeguard routinely fails.

Core claim

LLMs change observable behavior in 39.7 percent of modernization attempts on semantic-trap snippets. When the same model is asked to judge whether its output preserves behavior, it silently endorses 31.7 percent of the cases where behavior actually drifted, with per-model miss rates ranging from zero to one hundred percent.

What carries the argument

The self-review prompt in which the producing model is asked to declare whether its own modernization output leaves observable behavior unchanged, scored against a type-strict behavioral oracle.

If this is right

  • Self-review by the producing model cannot be treated as a reliable check for semantic drift.
  • Drift occurs at high rates even on models that can articulate the relevant Python 2 versus 3 distinction.
  • Failure rates do not decrease steadily with model capability or price.
  • A small set of numeric-semantics snippets triggers drift across nearly all models and prompt phrasings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar self-review blind spots may appear when LLMs are asked to verify other outputs that require precise behavioral or factual fidelity.
  • Teams could add an independent oracle or a second model to review modernization output before deployment.
  • The task may require explicit training on legacy-to-modern behavioral equivalence rather than scale alone.

Load-bearing premise

The behavioral oracle catches every change that would matter in real production use and the sixty snippets represent the legacy code teams actually modernize.

What would settle it

Running the identical modernization-plus-self-review protocol on several hundred additional real-world legacy Python files and finding that the 31.7 percent self-endorsement rate of drifted cases changes substantially.

Figures

Figures reproduced from arXiv: 2605.21537 by Aditya Lolla, Gokul Chandra Purnachandra Reddy, Harsha Sanku.

Figure 1
Figure 1. Figure 1: The experimental setup. Each legacy snippet is modernized by each model under each [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Drift rate by trap category. Benign-control snippets require no real modernization and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Drift rate by specific trap class. Numeric-semantics preservation is the dominant failure [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Drift rate per snippet (rows, sorted by mean drift across models) and per model (columns). [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Self-review outcomes on the model’s own semantic drift, broken down by trap class. Red [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-model semantic drift rate (x-axis) against the rate at which the same model silently [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly used to migrate legacy code to modern stacks. We ask a deceptively simple question: when an LLM modernizes legacy code, can the same model be relied upon to recognize when its own output silently changes observable behavior? We run 1,980 real modernization calls across 11 production LLMs from 7 distinct families on a balanced 60-snippet legacy-Python-2 corpus, evaluate every output with a type-strict behavioral oracle, and then ask each model to judge whether its own output preserves behavior. We report four findings. (1) Semantic-preservation drift is prevalent and sharply separable from a cleanly-controlled baseline: semantic-trap snippets drift in 39.7% of attempts versus 7.0% on benign-control code that requires no real modernization (+32.7 percentage points; n=660 each). (2) Drift concentrates on specific snippets that fail across models: pairwise model agreement on which snippets are hard is high (mean Pearson r=0.52), and a small core of numeric-semantics snippets fails for nearly every model and every prompt phrasing. (3) Self-review by the producing model is not a reliable safety net: across all semantic drift cases, 31.7% are silently endorsed by the same model that produced them (83/262), and the per-model self-miss rate is strongly bimodal -- ranging from 0% on five models to 100% on one widely deployed model -- with several models explicitly articulating the very Py2/Py3 semantic distinction that broke their output, then declaring behavior preserved. (4) Drift rate is non-monotone in model capability and price: per-model rates range 5.6%-46.7% and do not track model capability cleanly, indicating the failure is task-structural rather than driven by model scale. All code, prompts, the 60-snippet corpus, the behavioral oracle, the output extractor, and the raw model outputs are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper investigates whether LLMs used for modernizing legacy Python 2 code to Python 3 can reliably self-review their outputs for semantic drift. Across 1,980 modernization calls on 11 models and a 60-snippet corpus, evaluated via a type-strict behavioral oracle, it reports that drift occurs in 39.7% of semantic-trap cases (vs. 7.0% on controls), that 31.7% (83/262) of drifts are silently endorsed by the producing model, that per-model self-miss rates are strongly bimodal (0% to 100%), and that drift rates do not track model capability monotonically.

Significance. If the central measurements hold, the work provides a clear empirical demonstration that self-review is an unreliable safeguard for LLM-driven code modernization. The public release of the full corpus, prompts, oracle implementation, and raw outputs is a substantial strength that supports direct replication and follow-on studies. The bimodal self-miss pattern and the dissociation from model scale are findings with immediate practical implications for software engineering toolchains that rely on LLM agents.

major comments (1)
  1. [§4] §4 (Behavioral Oracle): The type-strict oracle is load-bearing for both the 39.7% drift rate and the 31.7% self-endorsement statistic. Because it only flags type violations, it can miss observable behavioral changes such as altered exception messages, mutable default argument evaluation order, string-formatting edge cases, or I/O side effects. The manuscript notes numeric-semantics failures but does not report a cross-check against differential execution traces or human judgment on a sample of outputs; without such validation the numerator of the self-miss rate remains uncertain.
minor comments (2)
  1. [Abstract] Abstract and §3: The per-model sample sizes (e.g., 180 calls per model) and the exact definition of the 660 control cases are stated clearly, but the prompt templates used for self-review are only summarized; including the full self-review prompt in an appendix would improve reproducibility.
  2. [§5.3] §5.3: The Pearson correlation (r=0.52) for pairwise model agreement on hard snippets is reported without a p-value or confidence interval; adding these would strengthen the claim that drift concentrates on specific snippets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance, reproducibility, and practical implications. We respond to the major comment on the behavioral oracle below.

read point-by-point responses
  1. Referee: [§4] §4 (Behavioral Oracle): The type-strict oracle is load-bearing for both the 39.7% drift rate and the 31.7% self-endorsement statistic. Because it only flags type violations, it can miss observable behavioral changes such as altered exception messages, mutable default argument evaluation order, string-formatting edge cases, or I/O side effects. The manuscript notes numeric-semantics failures but does not report a cross-check against differential execution traces or human judgment on a sample of outputs; without such validation the numerator of the self-miss rate remains uncertain.

    Authors: We agree that a type-strict oracle necessarily prioritizes detectable type mismatches and may overlook certain behavioral differences such as exception-message variations, mutable-default evaluation order, or I/O side effects. Our design choice was deliberate: the dominant Python 2/3 semantic traps (integer division, range objects, dict ordering, etc.) manifest as type or numeric-semantics changes that the oracle is engineered to catch reliably and reproducibly. The manuscript already foregrounds numeric-semantics failures as the core failure mode. The complete oracle, corpus, and raw outputs are released, enabling external verification. We did not originally include a differential-trace or human-judgment cross-check. In revision we will add (a) an explicit limitations paragraph on the oracle's scope and (b) a post-hoc validation on a random sample of 50 outputs comparing oracle verdicts against both execution traces and blinded human review. This will directly address the uncertainty in the self-miss numerator while preserving the automated, scalable nature of the primary measurement. revision: partial

Circularity Check

0 steps flagged

Direct empirical measurements with no derivation chain or self-referential reduction

full rationale

The paper reports results from running 1,980 modernization calls across 11 LLMs on a 60-snippet corpus, applying a type-strict behavioral oracle to detect drifts, and then querying models for self-review. Quantities such as the 31.7% self-endorsement rate (83/262) and per-model miss rates are direct counts from these experimental outputs. No equations, fitted parameters, ansatzes, or uniqueness theorems are invoked; the central claims rest on observable experimental data rather than any reduction to prior self-citations or input definitions. The study is self-contained against its own corpus and oracle.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the behavioral oracle is a faithful proxy for production behavior and that the 60 snippets capture the relevant failure modes; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption A type-strict behavioral oracle run on the original and modernized snippets will detect all behavior changes that matter for correctness.
    Invoked when the paper treats oracle verdicts as ground truth for semantic drift.

pith-pipeline@v0.9.0 · 5913 in / 1314 out tokens · 39470 ms · 2026-05-22T01:39:21.725897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    LegacyTranslate: LLM-based multi-agent method for legacy code translation,

    “LegacyTranslate: LLM-based multi-agent method for legacy code translation,” arXiv:2603.14054, 2026, doi:10.48550/arXiv.2603.14054

  2. [2]

    VAPU: system for autonomous legacy code modernization,

    “VAPU: system for autonomous legacy code modernization,” arXiv:2510.18509, 2025, doi:10.48550/arXiv.2510.18509

  3. [3]

    Environment-in-the-loop: rethinking code migration with LLM-based agents,

    “Environment-in-the-loop: rethinking code migration with LLM-based agents,” arXiv:2602.09944, 2026, doi:10.48550/arXiv.2602.09944

  4. [4]

    Leveraging LLMs for legacy code modernization: challenges and opportunities for LLM- generated documentation,

    “Leveraging LLMs for legacy code modernization: challenges and opportunities for LLM- generated documentation,” arXiv:2411.14971, 2024, doi:10.48550/arXiv.2411.14971

  5. [5]

    VeriGuard: enhancing LLM agent safety via verified code generation,

    “VeriGuard: enhancing LLM agent safety via verified code generation,” arXiv:2510.05156, 2025, doi:10.48550/arXiv.2510.05156

  6. [6]

    FormalJudge: a neuro-symbolic paradigm for agentic oversight,

    “FormalJudge: a neuro-symbolic paradigm for agentic oversight,” arXiv:2602.11136, 2026, doi:10.48550/arXiv.2602.11136

  7. [7]

    Testing equivalences for processes,

    R. De Nicola and M. Hennessy, “Testing equivalences for processes,”Theoretical Computer Science, vol. 34, pp. 83–133, 1984

  8. [8]

    An analysis and survey of the development of mutation test- ing,

    Y. Jia and M. Harman, “An analysis and survey of the development of mutation test- ing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, 2011, doi:10.1109/TSE.2010.62

  9. [9]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Representations (ICLR), 2023. 11