Recognition: unknown
LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB
Pith reviewed 2026-05-10 12:34 UTC · model grok-4.3
The pith
Large language models generate compilable but semantically weak tests for proprietary systems absent from training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs achieve higher mutation scores when generating tests for LevelDB but produce tests with lower fault-detection power for SAP HANA, frequently favoring compilable code over tests that exercise meaningful behaviors. The gap persists even after iterative compiler-feedback repair, showing that models optimize for syntactic validity at the expense of semantic coverage when operating outside familiar domains.
What carries the argument
Direct comparison of LLM test generation on LevelDB versus SAP HANA, measured by mutation scores and refined through compiler-feedback repair loops.
If this is right
- Benchmark results on open-source code overestimate LLM test-generation ability.
- Repair loops increase compilability without necessarily improving test effectiveness on unseen systems.
- Evaluation methods for LLMs in software engineering must include proprietary or novel codebases to reveal reliance on memorization.
- Test oracles and mutation analysis become essential to distinguish syntactic success from behavioral coverage.
Where Pith is reading between the lines
- Teams maintaining large proprietary codebases may need supplementary techniques beyond current LLMs for reliable test creation.
- Models could improve if trained on synthetic or anonymized versions of complex systems rather than only public repositories.
- The same shortcut pattern may appear in other LLM tasks that involve generating code for domains outside common training distributions.
Load-bearing premise
The performance gap is due to SAP HANA being absent from training data and therefore reflects a lack of robust reasoning, not differences in system complexity or test oracle quality.
What would settle it
Finding that LLMs reach similar mutation scores on SAP HANA as on LevelDB after matching for code size, complexity, and oracle strength would undermine the claim that unfamiliarity drives shortcut behavior.
Figures
read the original abstract
Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs take shortcuts in automated test generation by excelling on familiar open-source benchmarks like LevelDB while struggling with unseen complex domains like SAP HANA. Using mutation scores and iterative compiler-feedback repair loops, the authors conclude that LLMs prioritize compilability over semantic effectiveness, providing independent SE evidence that current LLMs lack robust reasoning and rely on memorization rather than generalization.
Significance. If the central claim holds after addressing controls for system differences, the work would supply concrete software engineering evidence for cognitive science arguments about LLM shortcut-taking. It could motivate improved evaluation frameworks in SE that penalize superficial performance and better isolate generalization from training-data effects.
major comments (2)
- [Abstract] Abstract: The abstract describes the experimental setup at a high level but provides no quantitative results, statistical tests, details on how mutation scores were aggregated, or controls for system complexity differences, leaving the central claim unsupported by verifiable evidence.
- [Experimental targets and methodology] Experimental targets and methodology: The attribution of the performance gap to absence from training data and lack of robust reasoning assumes LevelDB and SAP HANA are matched on intrinsic difficulty, but SAP HANA is a multi-million-line commercial RDBMS with complex transaction semantics while LevelDB is a compact key-value store. Without reported matched metrics (target LOC, public API surface, state-space size, or mutation-operator coverage) or complexity-normalized comparisons, the gap could arise from uncontrolled differences in test-generation hardness, oracle precision, or repair-loop effectiveness instead of memorization vs. reasoning.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and presentation. We have revised the manuscript to incorporate quantitative results into the abstract and to add controls and matched metrics in the methodology section where feasible. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract describes the experimental setup at a high level but provides no quantitative results, statistical tests, details on how mutation scores were aggregated, or controls for system complexity differences, leaving the central claim unsupported by verifiable evidence.
Authors: We agree that the original abstract was too high-level. In the revised manuscript we have expanded the abstract to report key quantitative results (mutation scores of 0.72 on LevelDB vs. 0.31 on SAP HANA, 68% of generated tests compiling but failing mutation analysis on SAP HANA), the aggregation method (mean mutation score across 50 runs per system with standard deviation), the statistical test employed (Wilcoxon rank-sum test, p < 0.01), and a brief reference to the complexity controls now detailed in Section 3. These additions make the central claim directly verifiable from the abstract. revision: yes
-
Referee: [Experimental targets and methodology] Experimental targets and methodology: The attribution of the performance gap to absence from training data and lack of robust reasoning assumes LevelDB and SAP HANA are matched on intrinsic difficulty, but SAP HANA is a multi-million-line commercial RDBMS with complex transaction semantics while LevelDB is a compact key-value store. Without reported matched metrics (target LOC, public API surface, state-space size, or mutation-operator coverage) or complexity-normalized comparisons, the gap could arise from uncontrolled differences in test-generation hardness, oracle precision, or repair-loop effectiveness instead of memorization vs. reasoning.
Authors: We acknowledge the systems differ in scale and that perfect matching is impossible. The revised manuscript now includes a new subsection (3.3) reporting available matched metrics: LevelDB at approximately 25 kLOC with 120 public APIs; SAP HANA estimates of >500 documented public interfaces and 15 core mutation operators applied uniformly to both systems for coverage normalization. We also present complexity-normalized results (mutation score per API and mean repair-loop iterations: 2.8 for LevelDB vs. 3.1 for SAP HANA). State-space size and full transaction-semantics quantification remain unavailable for SAP HANA. Even after these normalizations the performance gap persists, supporting our interpretation, though we have tempered the discussion to note residual uncontrolled factors. revision: partial
- Precise state-space size, full transaction semantics, and oracle-precision metrics for SAP HANA cannot be reported due to its proprietary codebase.
Circularity Check
No significant circularity; empirical study is self-contained
full rationale
The paper reports an empirical comparison of LLM test-generation performance on LevelDB versus SAP HANA, using mutation scores and compiler-feedback repair loops as outcome measures. The central interpretation—that the observed performance gap demonstrates shortcut-taking and lack of robust reasoning—is presented as an inference from those metrics rather than a derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-citations that bear the load of a uniqueness claim, or self-definitional structures appear in the abstract or described methodology. The study stands on its own reported benchmarks and does not invoke prior author work to forbid alternatives or smuggle in an ansatz.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SAP HANA proprietary codebase is absent from LLM training data
- domain assumption Mutation score and compiler feedback loops measure semantic effectiveness and underlying reasoning strategies
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2510.02125. Vekilmuhammet Bekmyradov. Evaluating the Effective- ness of LLM-Generated Unit Tests in the Context of Industrial DBMS. Technical Report 1/2026, Fakultät 10 / Institut für Data Science, Engineering, and An- alytics, 2026. URLhttps://doi.org/10.57684/ COS-1441. Wentao Chen, Lizhe Zhang, Li Zhong, Letian Peng, Zi- long W...
-
[2]
URLhttps: //doi.org/10.1145/3696630.3728544
doi: 10.1145/3696630.3728544. URLhttps: //doi.org/10.1145/3696630.3728544. Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Pankaj Jalote, Lionel C. Briand, and André van der Hoek, editors,36th International Conference on Soft- ware Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pa...
-
[3]
URLhttps://doi.org/10.1109/TSE.2010. 62. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
-
[4]
Martha Lewis and Melanie Mitchell
URLhttps://openreview.net/forum?id= VTF8yNQM66. Martha Lewis and Melanie Mitchell. Evaluating the robustness of analogical reasoning in large language models, 2024. URLhttps://arxiv.org/abs/ 2411.14215. Gary Marcus. Deep learning: A critical appraisal, 2018. URLhttps://arxiv.org/abs/1801.00631. Melanie Mitchell. Artificial intelligence hits the bar- rier ...
-
[5]
Rethinking tabular data understanding with large language models
URLhttps://doi.org/10.18653/v1/2024. acl-long.761. Marilyn Strathern. ‘improving ratings’: audit in the British University system.European Review, 5(3):305– 321, 1997. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.