pith. machine review for the scientific record. sign in

arxiv: 2605.14202 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: no theorem link

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLMrobustness testingmicroservicesprompt engineeringfailure modestest generationempirical studyAPI testing
0
0 comments X

The pith

Prompt strategy explains more variation in test diversity than model size when using LLMs for microservice robustness testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether LLMs can generate diverse robustness tests for microservice APIs by running seven prompt strategies on three models against two real systems. The results show that changing the prompt structure produces bigger differences in the variety of failures found than switching to a larger model. One new prompt called GuidedFewShot, which adds a mutation taxonomy plus examples, covered the most failure modes in single runs on both systems. A rigid structured prompt produced almost no variety in the tests generated. The pattern held on a small Java system with nine failure modes and a larger polyglot system with fourteen.

Core claim

Prompt strategy is the dominant factor determining the diversity and coverage of failure modes when LLMs generate robustness tests from API specifications. GuidedFewShot, which supplies a mutation taxonomy as domain knowledge together with concrete few-shot examples, reached the highest single-run coverage (five of nine modes on the first system and eight of fourteen on the second). Varying three prompts on one model achieved full coverage on one system and outperformed any fixed-prompt multi-model ensemble. A purely structured prompt eliminated diversity entirely. The key lesson is that taxonomy rules alone are not enough; LLMs need explicit examples to distinguish mutations such as missing

What carries the argument

GuidedFewShot prompt strategy that embeds prior robustness-testing mutation taxonomy plus concrete examples to guide LLM test generation.

If this is right

  • A single model varied across three prompt strategies can reach complete failure-mode coverage on some systems.
  • Multi-model ensembles under one fixed prompt cover fewer modes than prompt variation on one model.
  • GuidedFewShot maintains low similarity across models while delivering top coverage.
  • Taxonomy rules by themselves fail to help LLMs distinguish key-absent from value-empty mutations.
  • Results replicate on both a monolingual Java system and a polyglot system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Investing effort in prompt design may deliver larger gains than scaling model size for other LLM-based testing tasks.
  • The approach could be extended to additional testing goals such as performance or security by swapping the mutation taxonomy.
  • Using one well-prompted smaller model might lower cost compared with calling multiple large models for the same test budget.
  • Combining the best prompt with the largest model could be tested as a next step to check for further coverage gains.

Load-bearing premise

The nine and fourteen listed failure modes capture the full set of robustness problems in the two chosen systems, and the thirty-eight runs plus six hundred sixty-three tests are enough to show that differences come from the prompts rather than other unmeasured factors.

What would settle it

A new experiment on the same two systems in which a fixed prompt on a larger model produces equal or higher failure-mode diversity than GuidedFewShot across multiple runs would falsify the dominance of prompt strategy.

Figures

Figures reproduced from arXiv: 2605.14202 by Hrushitha Goud Tigulla, Marco Vieira.

Figure 1
Figure 1. Figure 1: Experimental pipeline. cluded: qwen3:14b under ZeroShot on both SUTs (the model switched to a non-English language mid-generation, producing unusable output), qwen3:14b under Self-Refine on OTel (malformed output that could not be parsed), and llama3.1:70b under Structured on OTel (repeated system crashes during generation). Excluded runs were not retried, consistent with the single-execution design. The l… view at source ↗
read the original abstract

Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a controlled experiment comparing 7 prompt strategies across 3 LLMs (14B–70B parameters) for generating robustness tests from API specifications in two microservice systems: a Java monolingual system (6 services, 9 failure modes) and a polyglot system (27 services, 14 failure modes). From 663 generated tests, 38 valid runs are retained and analyzed, showing that prompt strategy accounts for more variation in test diversity and failure-mode coverage than model size. A Structured prompt collapses diversity, while GuidedFewShot (which embeds a mutation taxonomy) achieves the highest single-run coverage (5/9 and 8/14 modes); one model varied across three prompts reaches complete coverage on one system, outperforming multi-model ensembles under fixed prompts. The study concludes that concrete examples are required for LLMs to apply taxonomy rules correctly and that findings replicate across both systems.

Significance. If the experimental controls hold, the work provides concrete, actionable evidence that prompt engineering incorporating domain taxonomies can outperform model scaling for automated robustness testing. The replication across architecturally distinct systems and the identification of specific prompt weaknesses (e.g., Structured collapsing diversity) strengthen its practical value for software engineering practitioners selecting LLM configurations for testing.

major comments (3)
  1. [Methodology] Methodology section (failure-mode identification): The derivation of the fixed sets of 9 and 14 failure modes is not described. It is unclear whether these lists were established independently from the systems' API specifications and prior robustness literature before any LLM runs, or whether they were influenced by initial generations. This is load-bearing for all coverage claims, including complete coverage by one model across three prompts.
  2. [Results] Results section (variation analysis): The central claim that prompt strategy explains more variation in diversity than model size is stated without supporting statistical decomposition (e.g., variance partitioning, permutation test, or mixed-effects model) that accounts for run-level noise. With only 38 valid runs, observed differences could be driven by a small number of high-leverage executions rather than strategy per se.
  3. [Experimental Procedure] Experimental procedure (test validity): Criteria for classifying a run as valid and for retaining 38 out of 663 tests (e.g., definition of server-side failure versus invalid test) are not specified, nor is any inter-rater reliability, blinding, or consistency check reported. This directly affects the reliability of the coverage and diversity metrics.
minor comments (2)
  1. [Abstract] Abstract: Add one sentence on how validity was assessed and whether any statistical comparison of prompt versus model effects was performed.
  2. [Discussion] Discussion: The observation that LLMs cannot distinguish key-absent from value-empty mutations without examples would be strengthened by quoting one concrete generated test that illustrates the confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Methodology] Methodology section (failure-mode identification): The derivation of the fixed sets of 9 and 14 failure modes is not described. It is unclear whether these lists were established independently from the systems' API specifications and prior robustness literature before any LLM runs, or whether they were influenced by initial generations. This is load-bearing for all coverage claims, including complete coverage by one model across three prompts.

    Authors: The failure-mode sets were derived independently before any LLM experiments began. We combined established robustness testing literature (boundary-value analysis, invalid-input mutations from prior API testing studies) with a manual inspection of each system's API specifications to identify relevant modes such as missing keys, type mismatches, and empty values. No LLM outputs were consulted during this process. We will add a dedicated subsection in the Methodology section that explicitly lists the literature sources, the mapping procedure, and the resulting 9- and 14-mode sets for each system. This addition will make the a-priori nature of the targets transparent and support all coverage claims. revision: yes

  2. Referee: [Results] Results section (variation analysis): The central claim that prompt strategy explains more variation in diversity than model size is stated without supporting statistical decomposition (e.g., variance partitioning, permutation test, or mixed-effects model) that accounts for run-level noise. With only 38 valid runs, observed differences could be driven by a small number of high-leverage executions rather than strategy per se.

    Authors: We agree that a formal statistical decomposition would strengthen the claim. With only 38 valid runs we performed direct metric comparisons rather than complex modeling. In the revision we will add a permutation test comparing diversity distributions across prompt strategies versus model sizes, together with a brief discussion of the small-sample limitation. The patterns remain consistent across both independent systems, but we accept that the added test will better quantify the relative contributions and address concerns about high-leverage runs. revision: partial

  3. Referee: [Experimental Procedure] Experimental procedure (test validity): Criteria for classifying a run as valid and for retaining 38 out of 663 tests (e.g., definition of server-side failure versus invalid test) are not specified, nor is any inter-rater reliability, blinding, or consistency check reported. This directly affects the reliability of the coverage and diversity metrics.

    Authors: We will expand the Experimental Procedure section to define validity explicitly: a generated test is retained as valid only if (1) it executes without syntax or runtime errors in the request itself and (2) it elicits a server-side failure response (non-2xx status that is not a standard client error). Tests that fail to parse or produce only client-side errors are discarded. Classification followed a written checklist; one author performed the initial labeling with a second author reviewing a random 20 % sample for consistency. We will report this procedure and the resulting agreement rate, while acknowledging the absence of full blinding as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent experimental runs

full rationale

The paper is a controlled empirical study that applies 7 prompt strategies to 3 LLMs on 2 external microservice systems, measures coverage against pre-identified failure modes (9 and 14), and reports diversity and validity from 663 generated tests. No derivations, equations, fitted parameters, or self-citation chains reduce any result to its inputs by construction; all outcomes derive from independent runs on external systems and failure-mode lists. The taxonomy embedding is cited as domain context from prior research and does not create a self-definitional loop. This is a standard non-circular empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about the representativeness of the two chosen systems and their failure modes, plus the validity of LLM outputs as robustness tests; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The two architecturally distinct microservice systems (Java monolingual with 6 services and polyglot with 27 services) and their listed failure modes (9 and 14) are representative for evaluating LLM robustness test generation.
    Invoked to support generalization of findings on prompt effectiveness and replication across systems.
  • domain assumption The 38 valid runs and generated tests accurately reflect the diversity and coverage properties of the prompt strategies without significant unmeasured bias in execution or validation.
    Basis for claims about variation explained by prompt strategy and superiority of GuidedFewShot.

pith-pipeline@v0.9.0 · 5559 in / 1585 out tokens · 67172 ms · 2026-05-15T04:37:53.738763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Assessing robustness of web-services infrastructures,

    M. Vieira, N. Laranjeiro, and H. Madeira, “Assessing robustness of web-services infrastructures,” inProc. IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN). IEEE, 2007, pp. 131–136

  2. [2]

    Automated robustness testing of off-the-shelf software components,

    N. P. Kropp, P. J. Koopman, and D. P. Siewiorek, “Automated robustness testing of off-the-shelf software components,” inProc. IEEE Int. Symp. Fault-Tolerant Computing (FTCS). IEEE, 1998, pp. 230–239

  3. [3]

    Leveraging large language models to improve REST API testing,

    M. Kim, Q. Xin, S. Sinha, and A. Orso, “Leveraging large language models to improve REST API testing,” inProc. IEEE/ACM Int. Conf. Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2024

  4. [4]

    TestForge: A bench- marking framework for LLM-based test case generation,

    M. Vieira, B. Shah, P. A. Shah, and V . Khadloya, “TestForge: A bench- marking framework for LLM-based test case generation,” inProc. IEEE Int. Conf. Software Analysis, Evolution and Reengineering (SANER), 2026, p. to appear

  5. [5]

    EvoSuite: Automatic test suite generation for object-oriented software,

    G. Fraser and A. Arcuri, “EvoSuite: Automatic test suite generation for object-oriented software,” inProc. ACM SIGSOFT Symp. Foundations of Software Engineering (FSE). ACM, 2011, pp. 416–419

  6. [6]

    wsrbench: An on-line tool for robustness benchmarking,

    N. Laranjeiro, S. Canelas, and M. Vieira, “wsrbench: An on-line tool for robustness benchmarking,” inProc. IEEE Int. Conf. Services Computing (SCC). IEEE, 2008, pp. 187–194

  7. [7]

    A black box tool for robustness testing of REST services,

    N. Laranjeiro, J. a. Agnelo, and J. Bernardino, “A black box tool for robustness testing of REST services,”IEEE Access, vol. 9, pp. 24 738– 24 754, 2021

  8. [8]

    RESTler: Stateful REST API fuzzing,

    V . Atlidakis, P. Godefroid, and M. Polishchuk, “RESTler: Stateful REST API fuzzing,” inProc. IEEE/ACM Int. Conf. Software Engineering (ICSE). IEEE, 2019, pp. 748–758

  9. [9]

    RESTful API automated test case generation with EvoMas- ter,

    A. Arcuri, “RESTful API automated test case generation with EvoMas- ter,”ACM Trans. Software Engineering and Methodology, vol. 28, no. 1, pp. 1–37, 2019

  10. [10]

    Open problems in fuzzing RESTful APIs: A comparison of tools,

    M. Zhang and A. Arcuri, “Open problems in fuzzing RESTful APIs: A comparison of tools,”ACM Trans. Software Engineering and Method- ology, vol. 32, no. 6, pp. 1–45, 2023

  11. [11]

    Randoop: Feedback-directed random testing for Java,

    C. Pacheco and M. D. Ernst, “Randoop: Feedback-directed random testing for Java,” inCompanion to OOPSLA. ACM, 2007, pp. 815–816

  12. [12]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiahet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020

  13. [13]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosmaet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022

  14. [14]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Guptaet al., “Self-refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023

  15. [15]

    TeaStore: A micro-service reference application for benchmarking, modeling and resource management research,

    J. von Kistowski, S. Eismann, N. Schmitt, A. Bauer, J. Grohmann, and S. Kounev, “TeaStore: A micro-service reference application for benchmarking, modeling and resource management research,” inProc. IEEE Int. Symp. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2018, pp. 223–236

  16. [16]

    Opentelemetry demo (astronomy shop),

    OpenTelemetry, “Opentelemetry demo (astronomy shop),” https://github. com/open-telemetry/opentelemetry-demo, 2024, accessed: 2025

  17. [17]

    The good, the bad, and the greedy: Evaluation of LLMs should not ignore non-determinism,

    Y . Song, G. Wang, S. Li, and B. Y . Lin, “The good, the bad, and the greedy: Evaluation of LLMs should not ignore non-determinism,” in Proc. Conf. Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025, pp. 4195–4206