Recognition: no theorem link
LLM-Based Robustness Testing of Microservice Applications: An Empirical Study
Pith reviewed 2026-05-15 04:37 UTC · model grok-4.3
The pith
Prompt strategy explains more variation in test diversity than model size when using LLMs for microservice robustness testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompt strategy is the dominant factor determining the diversity and coverage of failure modes when LLMs generate robustness tests from API specifications. GuidedFewShot, which supplies a mutation taxonomy as domain knowledge together with concrete few-shot examples, reached the highest single-run coverage (five of nine modes on the first system and eight of fourteen on the second). Varying three prompts on one model achieved full coverage on one system and outperformed any fixed-prompt multi-model ensemble. A purely structured prompt eliminated diversity entirely. The key lesson is that taxonomy rules alone are not enough; LLMs need explicit examples to distinguish mutations such as missing
What carries the argument
GuidedFewShot prompt strategy that embeds prior robustness-testing mutation taxonomy plus concrete examples to guide LLM test generation.
If this is right
- A single model varied across three prompt strategies can reach complete failure-mode coverage on some systems.
- Multi-model ensembles under one fixed prompt cover fewer modes than prompt variation on one model.
- GuidedFewShot maintains low similarity across models while delivering top coverage.
- Taxonomy rules by themselves fail to help LLMs distinguish key-absent from value-empty mutations.
- Results replicate on both a monolingual Java system and a polyglot system.
Where Pith is reading between the lines
- Investing effort in prompt design may deliver larger gains than scaling model size for other LLM-based testing tasks.
- The approach could be extended to additional testing goals such as performance or security by swapping the mutation taxonomy.
- Using one well-prompted smaller model might lower cost compared with calling multiple large models for the same test budget.
- Combining the best prompt with the largest model could be tested as a next step to check for further coverage gains.
Load-bearing premise
The nine and fourteen listed failure modes capture the full set of robustness problems in the two chosen systems, and the thirty-eight runs plus six hundred sixty-three tests are enough to show that differences come from the prompts rather than other unmeasured factors.
What would settle it
A new experiment on the same two systems in which a fixed prompt on a larger model produces equal or higher failure-mode diversity than GuidedFewShot across multiple runs would falsify the dominance of prompt strategy.
Figures
read the original abstract
Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a controlled experiment comparing 7 prompt strategies across 3 LLMs (14B–70B parameters) for generating robustness tests from API specifications in two microservice systems: a Java monolingual system (6 services, 9 failure modes) and a polyglot system (27 services, 14 failure modes). From 663 generated tests, 38 valid runs are retained and analyzed, showing that prompt strategy accounts for more variation in test diversity and failure-mode coverage than model size. A Structured prompt collapses diversity, while GuidedFewShot (which embeds a mutation taxonomy) achieves the highest single-run coverage (5/9 and 8/14 modes); one model varied across three prompts reaches complete coverage on one system, outperforming multi-model ensembles under fixed prompts. The study concludes that concrete examples are required for LLMs to apply taxonomy rules correctly and that findings replicate across both systems.
Significance. If the experimental controls hold, the work provides concrete, actionable evidence that prompt engineering incorporating domain taxonomies can outperform model scaling for automated robustness testing. The replication across architecturally distinct systems and the identification of specific prompt weaknesses (e.g., Structured collapsing diversity) strengthen its practical value for software engineering practitioners selecting LLM configurations for testing.
major comments (3)
- [Methodology] Methodology section (failure-mode identification): The derivation of the fixed sets of 9 and 14 failure modes is not described. It is unclear whether these lists were established independently from the systems' API specifications and prior robustness literature before any LLM runs, or whether they were influenced by initial generations. This is load-bearing for all coverage claims, including complete coverage by one model across three prompts.
- [Results] Results section (variation analysis): The central claim that prompt strategy explains more variation in diversity than model size is stated without supporting statistical decomposition (e.g., variance partitioning, permutation test, or mixed-effects model) that accounts for run-level noise. With only 38 valid runs, observed differences could be driven by a small number of high-leverage executions rather than strategy per se.
- [Experimental Procedure] Experimental procedure (test validity): Criteria for classifying a run as valid and for retaining 38 out of 663 tests (e.g., definition of server-side failure versus invalid test) are not specified, nor is any inter-rater reliability, blinding, or consistency check reported. This directly affects the reliability of the coverage and diversity metrics.
minor comments (2)
- [Abstract] Abstract: Add one sentence on how validity was assessed and whether any statistical comparison of prompt versus model effects was performed.
- [Discussion] Discussion: The observation that LLMs cannot distinguish key-absent from value-empty mutations without examples would be strengthened by quoting one concrete generated test that illustrates the confusion.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Methodology] Methodology section (failure-mode identification): The derivation of the fixed sets of 9 and 14 failure modes is not described. It is unclear whether these lists were established independently from the systems' API specifications and prior robustness literature before any LLM runs, or whether they were influenced by initial generations. This is load-bearing for all coverage claims, including complete coverage by one model across three prompts.
Authors: The failure-mode sets were derived independently before any LLM experiments began. We combined established robustness testing literature (boundary-value analysis, invalid-input mutations from prior API testing studies) with a manual inspection of each system's API specifications to identify relevant modes such as missing keys, type mismatches, and empty values. No LLM outputs were consulted during this process. We will add a dedicated subsection in the Methodology section that explicitly lists the literature sources, the mapping procedure, and the resulting 9- and 14-mode sets for each system. This addition will make the a-priori nature of the targets transparent and support all coverage claims. revision: yes
-
Referee: [Results] Results section (variation analysis): The central claim that prompt strategy explains more variation in diversity than model size is stated without supporting statistical decomposition (e.g., variance partitioning, permutation test, or mixed-effects model) that accounts for run-level noise. With only 38 valid runs, observed differences could be driven by a small number of high-leverage executions rather than strategy per se.
Authors: We agree that a formal statistical decomposition would strengthen the claim. With only 38 valid runs we performed direct metric comparisons rather than complex modeling. In the revision we will add a permutation test comparing diversity distributions across prompt strategies versus model sizes, together with a brief discussion of the small-sample limitation. The patterns remain consistent across both independent systems, but we accept that the added test will better quantify the relative contributions and address concerns about high-leverage runs. revision: partial
-
Referee: [Experimental Procedure] Experimental procedure (test validity): Criteria for classifying a run as valid and for retaining 38 out of 663 tests (e.g., definition of server-side failure versus invalid test) are not specified, nor is any inter-rater reliability, blinding, or consistency check reported. This directly affects the reliability of the coverage and diversity metrics.
Authors: We will expand the Experimental Procedure section to define validity explicitly: a generated test is retained as valid only if (1) it executes without syntax or runtime errors in the request itself and (2) it elicits a server-side failure response (non-2xx status that is not a standard client error). Tests that fail to parse or produce only client-side errors are discarded. Classification followed a written checklist; one author performed the initial labeling with a second author reviewing a random 20 % sample for consistency. We will report this procedure and the resulting agreement rate, while acknowledging the absence of full blinding as a limitation. revision: yes
Circularity Check
No circularity: purely empirical comparison with independent experimental runs
full rationale
The paper is a controlled empirical study that applies 7 prompt strategies to 3 LLMs on 2 external microservice systems, measures coverage against pre-identified failure modes (9 and 14), and reports diversity and validity from 663 generated tests. No derivations, equations, fitted parameters, or self-citation chains reduce any result to its inputs by construction; all outcomes derive from independent runs on external systems and failure-mode lists. The taxonomy embedding is cited as domain context from prior research and does not create a self-definitional loop. This is a standard non-circular empirical design.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The two architecturally distinct microservice systems (Java monolingual with 6 services and polyglot with 27 services) and their listed failure modes (9 and 14) are representative for evaluating LLM robustness test generation.
- domain assumption The 38 valid runs and generated tests accurately reflect the diversity and coverage properties of the prompt strategies without significant unmeasured bias in execution or validation.
Reference graph
Works this paper leans on
-
[1]
Assessing robustness of web-services infrastructures,
M. Vieira, N. Laranjeiro, and H. Madeira, “Assessing robustness of web-services infrastructures,” inProc. IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN). IEEE, 2007, pp. 131–136
work page 2007
-
[2]
Automated robustness testing of off-the-shelf software components,
N. P. Kropp, P. J. Koopman, and D. P. Siewiorek, “Automated robustness testing of off-the-shelf software components,” inProc. IEEE Int. Symp. Fault-Tolerant Computing (FTCS). IEEE, 1998, pp. 230–239
work page 1998
-
[3]
Leveraging large language models to improve REST API testing,
M. Kim, Q. Xin, S. Sinha, and A. Orso, “Leveraging large language models to improve REST API testing,” inProc. IEEE/ACM Int. Conf. Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2024
work page 2024
-
[4]
TestForge: A bench- marking framework for LLM-based test case generation,
M. Vieira, B. Shah, P. A. Shah, and V . Khadloya, “TestForge: A bench- marking framework for LLM-based test case generation,” inProc. IEEE Int. Conf. Software Analysis, Evolution and Reengineering (SANER), 2026, p. to appear
work page 2026
-
[5]
EvoSuite: Automatic test suite generation for object-oriented software,
G. Fraser and A. Arcuri, “EvoSuite: Automatic test suite generation for object-oriented software,” inProc. ACM SIGSOFT Symp. Foundations of Software Engineering (FSE). ACM, 2011, pp. 416–419
work page 2011
-
[6]
wsrbench: An on-line tool for robustness benchmarking,
N. Laranjeiro, S. Canelas, and M. Vieira, “wsrbench: An on-line tool for robustness benchmarking,” inProc. IEEE Int. Conf. Services Computing (SCC). IEEE, 2008, pp. 187–194
work page 2008
-
[7]
A black box tool for robustness testing of REST services,
N. Laranjeiro, J. a. Agnelo, and J. Bernardino, “A black box tool for robustness testing of REST services,”IEEE Access, vol. 9, pp. 24 738– 24 754, 2021
work page 2021
-
[8]
RESTler: Stateful REST API fuzzing,
V . Atlidakis, P. Godefroid, and M. Polishchuk, “RESTler: Stateful REST API fuzzing,” inProc. IEEE/ACM Int. Conf. Software Engineering (ICSE). IEEE, 2019, pp. 748–758
work page 2019
-
[9]
RESTful API automated test case generation with EvoMas- ter,
A. Arcuri, “RESTful API automated test case generation with EvoMas- ter,”ACM Trans. Software Engineering and Methodology, vol. 28, no. 1, pp. 1–37, 2019
work page 2019
-
[10]
Open problems in fuzzing RESTful APIs: A comparison of tools,
M. Zhang and A. Arcuri, “Open problems in fuzzing RESTful APIs: A comparison of tools,”ACM Trans. Software Engineering and Method- ology, vol. 32, no. 6, pp. 1–45, 2023
work page 2023
-
[11]
Randoop: Feedback-directed random testing for Java,
C. Pacheco and M. D. Ernst, “Randoop: Feedback-directed random testing for Java,” inCompanion to OOPSLA. ACM, 2007, pp. 815–816
work page 2007
-
[12]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiahet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020
work page 2020
-
[13]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosmaet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022
work page 2022
-
[14]
Self-refine: Iterative refinement with self-feedback,
A. Madaan, N. Tandon, P. Guptaet al., “Self-refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023
work page 2023
-
[15]
J. von Kistowski, S. Eismann, N. Schmitt, A. Bauer, J. Grohmann, and S. Kounev, “TeaStore: A micro-service reference application for benchmarking, modeling and resource management research,” inProc. IEEE Int. Symp. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2018, pp. 223–236
work page 2018
-
[16]
Opentelemetry demo (astronomy shop),
OpenTelemetry, “Opentelemetry demo (astronomy shop),” https://github. com/open-telemetry/opentelemetry-demo, 2024, accessed: 2025
work page 2024
-
[17]
The good, the bad, and the greedy: Evaluation of LLMs should not ignore non-determinism,
Y . Song, G. Wang, S. Li, and B. Y . Lin, “The good, the bad, and the greedy: Evaluation of LLMs should not ignore non-determinism,” in Proc. Conf. Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025, pp. 4195–4206
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.