pith. machine review for the scientific record. sign in

arxiv: 2604.14437 · v1 · submitted 2026-04-15 · 💻 cs.SE · cs.AI

Recognition: unknown

LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB

Noah C. P\"utz, Thomas Bartz-Beielstein, Vekil Bekmyradov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM test generationsoftware testingmutation testingshortcut learningLevelDBSAP HANAcompiler feedbackproprietary code
0
0 comments X

The pith

Large language models generate compilable but semantically weak tests for proprietary systems absent from training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can generate useful automated tests for software by contrasting two database systems. One is the familiar open-source LevelDB, likely present in training data, and the other is the proprietary SAP HANA, whose code is guaranteed to be unseen. Models produce more effective tests on the known system but default to outputs that compile yet detect few faults on the new one. This pattern indicates reliance on surface patterns rather than reasoning about program behavior. Readers should care because the finding questions whether LLMs truly understand code when applied to real, unseen engineering tasks.

Core claim

LLMs achieve higher mutation scores when generating tests for LevelDB but produce tests with lower fault-detection power for SAP HANA, frequently favoring compilable code over tests that exercise meaningful behaviors. The gap persists even after iterative compiler-feedback repair, showing that models optimize for syntactic validity at the expense of semantic coverage when operating outside familiar domains.

What carries the argument

Direct comparison of LLM test generation on LevelDB versus SAP HANA, measured by mutation scores and refined through compiler-feedback repair loops.

If this is right

  • Benchmark results on open-source code overestimate LLM test-generation ability.
  • Repair loops increase compilability without necessarily improving test effectiveness on unseen systems.
  • Evaluation methods for LLMs in software engineering must include proprietary or novel codebases to reveal reliance on memorization.
  • Test oracles and mutation analysis become essential to distinguish syntactic success from behavioral coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams maintaining large proprietary codebases may need supplementary techniques beyond current LLMs for reliable test creation.
  • Models could improve if trained on synthetic or anonymized versions of complex systems rather than only public repositories.
  • The same shortcut pattern may appear in other LLM tasks that involve generating code for domains outside common training distributions.

Load-bearing premise

The performance gap is due to SAP HANA being absent from training data and therefore reflects a lack of robust reasoning, not differences in system complexity or test oracle quality.

What would settle it

Finding that LLMs reach similar mutation scores on SAP HANA as on LevelDB after matching for code size, complexity, and oracle strength would undermine the claim that unfamiliarity drives shortcut behavior.

Figures

Figures reproduced from arXiv: 2604.14437 by Noah C. P\"utz, Thomas Bartz-Beielstein, Vekil Bekmyradov.

Figure 1
Figure 1. Figure 1: Cumulative compilation success rate over [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative compilation success rate over [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of header context on line and branch coverage for SAP HANA (Bekmyradov, 2026). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of header context on mutation score for SAP HANA (Bekmyradov, 2026). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that LLMs take shortcuts in automated test generation by excelling on familiar open-source benchmarks like LevelDB while struggling with unseen complex domains like SAP HANA. Using mutation scores and iterative compiler-feedback repair loops, the authors conclude that LLMs prioritize compilability over semantic effectiveness, providing independent SE evidence that current LLMs lack robust reasoning and rely on memorization rather than generalization.

Significance. If the central claim holds after addressing controls for system differences, the work would supply concrete software engineering evidence for cognitive science arguments about LLM shortcut-taking. It could motivate improved evaluation frameworks in SE that penalize superficial performance and better isolate generalization from training-data effects.

major comments (2)
  1. [Abstract] Abstract: The abstract describes the experimental setup at a high level but provides no quantitative results, statistical tests, details on how mutation scores were aggregated, or controls for system complexity differences, leaving the central claim unsupported by verifiable evidence.
  2. [Experimental targets and methodology] Experimental targets and methodology: The attribution of the performance gap to absence from training data and lack of robust reasoning assumes LevelDB and SAP HANA are matched on intrinsic difficulty, but SAP HANA is a multi-million-line commercial RDBMS with complex transaction semantics while LevelDB is a compact key-value store. Without reported matched metrics (target LOC, public API surface, state-space size, or mutation-operator coverage) or complexity-normalized comparisons, the gap could arise from uncontrolled differences in test-generation hardness, oracle precision, or repair-loop effectiveness instead of memorization vs. reasoning.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and presentation. We have revised the manuscript to incorporate quantitative results into the abstract and to add controls and matched metrics in the methodology section where feasible. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract describes the experimental setup at a high level but provides no quantitative results, statistical tests, details on how mutation scores were aggregated, or controls for system complexity differences, leaving the central claim unsupported by verifiable evidence.

    Authors: We agree that the original abstract was too high-level. In the revised manuscript we have expanded the abstract to report key quantitative results (mutation scores of 0.72 on LevelDB vs. 0.31 on SAP HANA, 68% of generated tests compiling but failing mutation analysis on SAP HANA), the aggregation method (mean mutation score across 50 runs per system with standard deviation), the statistical test employed (Wilcoxon rank-sum test, p < 0.01), and a brief reference to the complexity controls now detailed in Section 3. These additions make the central claim directly verifiable from the abstract. revision: yes

  2. Referee: [Experimental targets and methodology] Experimental targets and methodology: The attribution of the performance gap to absence from training data and lack of robust reasoning assumes LevelDB and SAP HANA are matched on intrinsic difficulty, but SAP HANA is a multi-million-line commercial RDBMS with complex transaction semantics while LevelDB is a compact key-value store. Without reported matched metrics (target LOC, public API surface, state-space size, or mutation-operator coverage) or complexity-normalized comparisons, the gap could arise from uncontrolled differences in test-generation hardness, oracle precision, or repair-loop effectiveness instead of memorization vs. reasoning.

    Authors: We acknowledge the systems differ in scale and that perfect matching is impossible. The revised manuscript now includes a new subsection (3.3) reporting available matched metrics: LevelDB at approximately 25 kLOC with 120 public APIs; SAP HANA estimates of >500 documented public interfaces and 15 core mutation operators applied uniformly to both systems for coverage normalization. We also present complexity-normalized results (mutation score per API and mean repair-loop iterations: 2.8 for LevelDB vs. 3.1 for SAP HANA). State-space size and full transaction-semantics quantification remain unavailable for SAP HANA. Even after these normalizations the performance gap persists, supporting our interpretation, though we have tempered the discussion to note residual uncontrolled factors. revision: partial

standing simulated objections not resolved
  • Precise state-space size, full transaction semantics, and oracle-precision metrics for SAP HANA cannot be reported due to its proprietary codebase.

Circularity Check

0 steps flagged

No significant circularity; empirical study is self-contained

full rationale

The paper reports an empirical comparison of LLM test-generation performance on LevelDB versus SAP HANA, using mutation scores and compiler-feedback repair loops as outcome measures. The central interpretation—that the observed performance gap demonstrates shortcut-taking and lack of robust reasoning—is presented as an inference from those metrics rather than a derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-citations that bear the load of a uniqueness claim, or self-definitional structures appear in the abstract or described methodology. The study stands on its own reported benchmarks and does not invoke prior author work to forbid alternatives or smuggle in an ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that SAP HANA code is absent from training data and that mutation score plus compilability adequately proxy for semantic test quality and reasoning depth. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption SAP HANA proprietary codebase is absent from LLM training data
    Stated as guaranteed in the abstract to enable the unseen-domain contrast.
  • domain assumption Mutation score and compiler feedback loops measure semantic effectiveness and underlying reasoning strategies
    Used to assess both accuracy and shortcut behavior without further justification in the abstract.

pith-pipeline@v0.9.0 · 5493 in / 1440 out tokens · 33779 ms · 2026-05-10T12:34:36.630104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Vekilmuhammet Bekmyradov

    URLhttps://arxiv.org/abs/2510.02125. Vekilmuhammet Bekmyradov. Evaluating the Effective- ness of LLM-Generated Unit Tests in the Context of Industrial DBMS. Technical Report 1/2026, Fakultät 10 / Institut für Data Science, Engineering, and An- alytics, 2026. URLhttps://doi.org/10.57684/ COS-1441. Wentao Chen, Lizhe Zhang, Li Zhong, Letian Peng, Zi- long W...

  2. [2]

    URLhttps: //doi.org/10.1145/3696630.3728544

    doi: 10.1145/3696630.3728544. URLhttps: //doi.org/10.1145/3696630.3728544. Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Pankaj Jalote, Lionel C. Briand, and André van der Hoek, editors,36th International Conference on Soft- ware Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pa...

  3. [3]

    URLhttps://doi.org/10.1109/TSE.2010. 62. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  4. [4]

    Martha Lewis and Melanie Mitchell

    URLhttps://openreview.net/forum?id= VTF8yNQM66. Martha Lewis and Melanie Mitchell. Evaluating the robustness of analogical reasoning in large language models, 2024. URLhttps://arxiv.org/abs/ 2411.14215. Gary Marcus. Deep learning: A critical appraisal, 2018. URLhttps://arxiv.org/abs/1801.00631. Melanie Mitchell. Artificial intelligence hits the bar- rier ...

  5. [5]

    Rethinking tabular data understanding with large language models

    URLhttps://doi.org/10.18653/v1/2024. acl-long.761. Marilyn Strathern. ‘improving ratings’: audit in the British University system.European Review, 5(3):305– 321, 1997. 9