Recognition: unknown
RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements
Pith reviewed 2026-05-07 15:53 UTC · model grok-4.3
The pith
RESTestBench shows LLM-generated REST API tests lose effectiveness when interacting with faulty code, especially under vague requirements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using RESTestBench, we show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.
What carries the argument
The requirements-based mutation testing metric, which scores how effectively a test case detects mutations that violate a given natural language requirement.
If this is right
- Precise natural language requirements yield higher test effectiveness than vague ones across the tested LLMs.
- Refinement via interaction with the running service improves effectiveness only when requirements are vague and code is correct.
- When requirements are detailed, direct generation without SUT interaction performs as well as or better than refinement.
- Test effectiveness is sensitive to the presence of faults in the code observed during generation.
Where Pith is reading between the lines
- Teams could focus effort on writing precise requirements rather than building complex refinement loops for LLM test generators.
- Applying the same benchmark and metric to non-REST APIs or different mutation operators would test whether the pattern generalizes.
- The results suggest exploring whether static analysis or other non-execution context could substitute for runtime interaction in high-detail cases.
Load-bearing premise
The manually verified natural language requirements accurately and completely capture the intended functional behavior of the REST services, and the mutation metric correctly measures requirement-specific fault detection.
What would settle it
An experiment in which refinement on mutated code produces equal or higher fault-detection rates for vague requirements than non-refined generation would falsify the reported drop in effectiveness.
Figures
read the original abstract
Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RESTestBench, a benchmark with three REST services and manually verified NL requirements in precise and vague variants, plus a requirements-based mutation testing metric extending Bartocci et al.'s property-based approach. It evaluates LLM-based test generation (non-refinement vs. refinement with valid/mutated SUT interaction) and claims that effectiveness drops substantially on mutated code—especially for vague requirements—sometimes negating refinement benefits, while high-detail requirements render SUT interaction unnecessary.
Significance. If the metric and requirements hold, the work provides a needed functional-behavior-focused alternative to coverage/crash metrics for assessing LLM-generated REST tests from NL specs. The controlled precise/vague variants and valid/mutated SUT conditions are a clear strength for isolating when refinement helps or harms, with potential to guide practical tool design in API testing.
major comments (3)
- [Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.
- [§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.
- [§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.
minor comments (3)
- [Abstract] The abstract would be clearer if it stated the exact number of services, LLMs, and test cases upfront rather than leaving them implicit.
- [§3.2] Notation for the mutation metric (e.g., how effectiveness is aggregated across requirements) should be defined explicitly in a dedicated subsection or equation.
- [§2] Related-work discussion of Bartocci et al. could include a brief comparison table of the original property-based approach versus the new requirements-based extension.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the identification of areas where additional clarity and detail will strengthen the presentation of RESTestBench and its evaluation. We address each major comment below and commit to revisions that make the work more verifiable without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.
Authors: We agree that these experimental details are essential for verifiability and were insufficiently specified in the current draft. In the revised manuscript we will expand §4 (and update the abstract) to report: the precise LLMs and versions evaluated, the total number of test cases generated under each condition (non-refinement, refinement with valid SUT, refinement with mutated SUT), the statistical tests performed (including p-values and effect sizes), and a complete description of the mutation-metric scoring procedure, including the mutant-selection criteria used to target requirement violations. These additions will allow readers to reproduce and assess the reported effectiveness drops. revision: yes
-
Referee: [§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.
Authors: We acknowledge that the current description of the metric remains too high-level. We will revise §3.2 to include an operational definition of the requirements-based mutation score, a concrete worked example that extracts properties from both a vague and a precise requirement, the mutant-generation process (including the operators chosen and how they are aligned with specific requirement violations), and the exact scoring rules. These additions will clarify how the metric isolates the effects of requirement detail and SUT mutation from unrelated faults. revision: yes
-
Referee: [§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.
Authors: We agree that the verification process for the NL requirements must be documented. In the revised §3.1 we will describe the manual verification steps performed, any consistency checks applied (including inter-rater procedures if multiple reviewers were involved), and the measures taken to ensure requirements were derived from specifications and documentation independently of the implementation. If formal inter-rater metrics were not computed in the original process, we will either add them or explicitly note this as a limitation while still providing the verification protocol used. revision: yes
Circularity Check
No circularity: empirical benchmark and metric evaluation is self-contained
full rationale
The paper introduces RESTestBench as a new benchmark with manually verified NL requirements (precise/vague variants) and defines a requirements-based mutation testing metric as an explicit extension of the external property-based approach from Bartocci et al. Central claims about test effectiveness under refinement (with/without mutated SUT) are obtained directly from controlled experiments on three REST services using state-of-the-art LLMs; no equations, parameters, or results are fitted to the target outcomes and then relabeled as predictions. The single external citation is not self-referential, does not bear the uniqueness or definitional load, and is not required to close any derivation loop. The work is therefore an independent empirical contribution whose validity rests on the experimental design rather than on any reduction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Manually verified NL requirements accurately represent the intended functional behavior of the REST services
- domain assumption The requirements-based mutation testing metric correctly measures fault-detection effectiveness for a specific requirement
invented entities (2)
-
RESTestBench benchmark
no independent evidence
-
requirements-based mutation testing metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The Mother of All Demo Apps
[n. d.]. Realworld-Apps/Realworld: "The Mother of All Demo Apps" — Exemplary Fullstack Medium.Com Clone Powered by React, Angular, Node, Django, and Many More. https://github.com/realworld-apps/realworld?tab=readme-ov-file
-
[2]
Fastapi/Full-Stack-Fastapi-Template
2026. Fastapi/Full-Stack-Fastapi-Template. FastAPI
2026
-
[3]
Alonso, Sergio Segura, and Antonio Ruiz-Cortés
Juan C. Alonso, Sergio Segura, and Antonio Ruiz-Cortés. 2023. AGORA: Auto- mated Generation of Test Oracles for REST APIs. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Seattle WA USA, 1018–1030. doi:10.1145/3597926.3598114
-
[4]
2017.Introduction to Software Testing(2nd ed ed.)
Paul Ammann and Jeff Offutt. 2017.Introduction to Software Testing(2nd ed ed.). Cambridge university press, Cambridge
2017
-
[5]
Henry Andrews. 2026. OpenAPI Specification. Kogler L., Hangler S., Ehrhart M., Dornauer B., Wuersching R., and Schrammel P
2026
-
[6]
Andrea Arcuri. 2021. Automated Black- and White-Box Testing of RESTful APIs With EvoMaster.IEEE Software38, 3 (May 2021), 72–78. doi:10.1109/MS.2020. 3013820
-
[7]
Chetan Arora, Tomas Herda, and Verena Homm. 2024. Generating Test Scenarios from NL Requirements Using Retrieval-Augmented LLMs: An Industrial Study. In 2024 IEEE 32nd International Requirements Engineering Conference (RE). 240–251. arXiv:2404.12772 [cs] doi:10.1109/RE59067.2024.00031
- [8]
-
[9]
Cristian Augusto, Antonia Bertolino, Guglielmo De Angelis, Francesca Lonetti, and Jesús Morán. 2025. Large Language Models for Software Testing: A Research Roadmap. doi:10.48550/ARXIV.2509.25043
-
[10]
Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
-
[11]
The Oracle Problem in Software Testing: A Survey.IEEE Transactions on Software Engineering41, 5 (May 2015), 507–525. doi:10.1109/TSE.2014.2372785
- [12]
- [13]
-
[14]
2012.Writing Effective Use Cases(24
Alistair Cockburn. 2012.Writing Effective Use Cases(24. print ed.). Addison- Wesley, Boston
2012
-
[15]
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and An- thony Ventresque. 2016. PIT: A Practical Mutation Testing Tool for Java (Demo). InProceedings of the 25th International Symposium on Software Testing and Anal- ysis. ACM, Saarbrücken Germany, 449–452. doi:10.1145/2931037.2948707
- [16]
-
[17]
Serge Demeyer, Ali Parsai, Sten Vercammen, Brent Van Bladel, and Mehrdad Abdi
-
[18]
Fouriscale: A frequency perspective on training-free high-resolution image synthesis,
Formal Verification of Developer Tests: A Research Agenda Inspired by Mutation Testing. InLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles, Tiziana Margaria and Bernhard Steffen (Eds.). Vol. 12477. Springer International Publishing, Cham, 9–24. doi:10.1007/978-3- 030-61470-6_2
-
[19]
David Fowler. 2026. Davidfowl/TodoApp
2026
-
[21]
Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2024. Testing RESTful APIs: A Survey.ACM Transactions on Software Engineering and Methodology33, 1 (Jan. 2024), 1–41. doi:10.1145/3617175
-
[22]
Alex Groce, Josie Holmes, Darko Marinov, August Shi, and Lingming Zhang
-
[23]
InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings
An Extensible, Regular-Expression-Based Tool for Multi-Language Mu- tant Generation. InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings. ACM, Gothenburg Sweden, 25–28. doi:10.1145/3183440.3183485
-
[24]
Soneya Binta Hossain and Matthew B. Dwyer. 2022. A Brief Survey on Oracle- based Test Adequacy Metrics. doi:10.48550/ARXIV.2212.06118
- [25]
-
[26]
Laura Inozemtseva and Reid Holmes. 2014. Coverage Is Not Strongly Correlated with Test Suite Effectiveness. InProceedings of the 36th International Conference on Software Engineering. ACM, Hyderabad India, 435–445. doi:10.1145/2568225. 2568271
-
[27]
2005.Object-Oriented Software Engineering: A Use Case Driven Approach
Ivar Jacobson. 2005.Object-Oriented Software Engineering: A Use Case Driven Approach. Addison-Wesley Pub, Reading, Mass
2005
-
[28]
Lukas Jakob. 2026. Lujakob/Nestjs-Realworld-Example-App
2026
-
[29]
Leon Kogler, Maximilian Ehrhart, Benedikt Dornauer, and Eduard Paul Enoiu
- [30]
-
[31]
Michael Konstantinou, Renzo Degiovanni, and Mike Papadakis. 2024. Do LLMs Generate Test Oracles That Capture the Actual or the Expected Program Be- haviour? doi:10.48550/ARXIV.2410.21136
- [32]
-
[33]
Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortés. 2021. RESTest: Automated Black-Box Testing of RESTful Web APIs. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Denmark, 682–685. doi:10.1145/3460319.3469082
-
[34]
Mohammed Mudassir and Mohammed Mushtaq. 2024. The Role of APIs in Mod- ern Software Development.World Journal of Advanced Engineering Technology and Sciences13, 1 (Oct. 2024), 1045–1047. doi:10.30574/wjaets.2024.13.1.0515
- [35]
-
[36]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation Testing Advances: An Analysis and Survey. InAdvances in Computers. Vol. 112. Elsevier, 275–378. doi:10.1016/bs.adcom.2018.03.015
-
[37]
Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection?: A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. InProceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg Sweden, 537–548. doi:10.1145/3180155.3180183
-
[38]
André Pereira, Bruno Lima, and João Pascoal Faria. 2024. APITestGenie: Auto- mated API Test Generation through Generative AI. doi:10.48550/ARXIV.2409. 03838
-
[39]
Postman. [n. d.].State of the API Report. Technical Report
-
[40]
H. G. Rice. 1953. Classes of Recursively Enumerable Sets and Their Decision Problems.Trans. Amer. Math. Soc.74, 2 (1953), 358–366. doi:10.1090/S0002-9947- 1953-0053041-6
-
[41]
Sánchez, José A
Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Pa- padakis. 2024. Mutation Testing in Practice: Insights From Open-Source Software Developers.IEEE Transactions on Software Engineering50, 5 (May 2024), 1130–
2024
-
[42]
doi:10.1109/TSE.2024.3377378
-
[43]
A. Van Lamsweerde. 2000. Goal-Oriented Requirements Engineering: A Guided Tour. InProceedings Fifth IEEE International Symposium on Requirements Engi- neering. IEEE Comput. Soc, Toronto, Ont., Canada, 249–262. doi:10.1109/ISRE. 2001.948567
-
[44]
Dries Vanoverberghe, Jonathan De Halleux, Nikolai Tillmann, and Frank Piessens
-
[45]
In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.)
State Coverage: Software Validation Metrics beyond Code Coverage. In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.). Vol. 7147. Springer Berlin Heidelberg, Berlin, Heidelberg, 542–553. doi:10.1007/ 978-3-642-27660-6_44
2012
-
[46]
Wiegers, Joy Beatty, and Karl E
Karl E. Wiegers, Joy Beatty, and Karl E. Wiegers. 2013.Software Requirements(3. ed. [fully updated and expanded] ed.). Microsoft Press, Redmond, Wash
2013
- [47]
-
[48]
Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, and Xin Peng. 2025. LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents. doi:10.48550/ARXIV.2503.15079
-
[49]
On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution
Didar Zowghi and Vincenzo Gervasi. 2004. Erratum to “On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution”.Information and Software Technology46, 11 (Sept. 2004), 763–779. doi:10.1016/j.infsof.2004.03.003 Received 02 March 2026; accepted 21 April 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.