arxiv: 2604.25862 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.AI

Recognition: unknown

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Leon Kogler , Stefan Hangler , Maximilian Ehrhart , Benedikt Dornauer , Roland Wuersching , Peter Schrammel

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:53 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords REST API testingLLM-generated testsnatural language requirementsmutation testingtest case generationbenchmarksoftware testing

0 comments

The pith

RESTestBench shows LLM-generated REST API tests lose effectiveness when interacting with faulty code, especially under vague requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RESTestBench, a benchmark with three REST services and manually verified natural language requirements in both precise and vague versions. It defines a requirements-based mutation testing metric to measure how well a generated test detects faults tied to each specific requirement. Evaluations compare plain LLM generation against refinement that lets the model query the running service, including mutated versions. The central finding is that effectiveness falls sharply with faulty or mutated code and vague requirements, sometimes removing any gain from refinement.

Core claim

Using RESTestBench, we show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

What carries the argument

The requirements-based mutation testing metric, which scores how effectively a test case detects mutations that violate a given natural language requirement.

If this is right

Precise natural language requirements yield higher test effectiveness than vague ones across the tested LLMs.
Refinement via interaction with the running service improves effectiveness only when requirements are vague and code is correct.
When requirements are detailed, direct generation without SUT interaction performs as well as or better than refinement.
Test effectiveness is sensitive to the presence of faults in the code observed during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could focus effort on writing precise requirements rather than building complex refinement loops for LLM test generators.
Applying the same benchmark and metric to non-REST APIs or different mutation operators would test whether the pattern generalizes.
The results suggest exploring whether static analysis or other non-execution context could substitute for runtime interaction in high-detail cases.

Load-bearing premise

The manually verified natural language requirements accurately and completely capture the intended functional behavior of the REST services, and the mutation metric correctly measures requirement-specific fault detection.

What would settle it

An experiment in which refinement on mutated code produces equal or higher fault-detection rates for vague requirements than non-refined generation would falsify the reported drop in effectiveness.

Figures

Figures reproduced from arXiv: 2604.25862 by Benedikt Dornauer, Leon Kogler, Maximilian Ehrhart, Peter Schrammel, Roland Wuersching, Stefan Hangler.

**Figure 1.** Figure 1: RESTestBench overview 3.1 Selection of suitable REST API Services The benchmark services were selected to support controlled, reproducible experimentation while remaining representative of realworld REST backends. Practical relevance refers to selecting services that resemble production-style REST backends in terms of API surface, security mechanisms, and that include complex operation dependencies requ… view at source ↗

**Figure 2.** Figure 2: Visualization of the evaluation of 𝑘 valid 𝜙𝑖 and 𝑘 𝑚 𝜙𝑖 for a single requirement 𝜙𝑖 in refinement-based test generation approaches. Summing over all requirements, we define 𝑘 𝑚 𝜙 = ∑︁ 𝑖 𝑘 𝑚 𝜙𝑖 , 𝑘valid 𝜙 = ∑︁ 𝑖 𝑘 valid 𝜙𝑖 . Based on this distinction, we compute two mutation score values for refinement-based approaches: 𝑀𝑆valid 𝜙 = 𝑘 valid 𝜙 |𝑀| and 𝑀𝑆m 𝜙 = 𝑘 𝑚 𝜙 |𝑀| By computing both scores for refinement… view at source ↗

**Figure 3.** Figure 3: Mutation scores for vague and precise requirements comparing single-step 𝑀𝑆𝜙 (solid bars) with refinement-based scores 𝑀𝑆valid 𝜙 (forward-hatched) and 𝑀𝑆m 𝜙 (cross-hatched). Bars are stacked by service (FastAPI, RealWorld, TodoApp); segment height reflects weighted service contribution and segment percentages are per-service mutation scores. substantial time and computational cost of the full benchmark, ea… view at source ↗

**Figure 4.** Figure 4: Mean total generation cost (USD) per benchmark view at source ↗

read the original abstract

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RESTestBench adds a requirements-tied benchmark and mutation metric that improves on code coverage for LLM test generation, but the abstract leaves the metric's mechanics and experimental details too thin to fully trust the main claims.

read the letter

This paper introduces RESTestBench, a benchmark with three REST services and paired natural language requirements in precise and vague versions. It also defines a requirements-based mutation testing metric that scores whether a generated test catches faults violating a specific requirement, extending Bartocci et al.'s property-based idea. The evaluation compares plain generation against refinement that interacts with the live SUT, including mutated versions, and reports that effectiveness drops when the generator sees faulty code, especially with vague requirements, sometimes erasing the gains from refinement. High-detail requirements appear to make direct SUT feedback less necessary. These points address a real weakness in prior LLM testing work that relied on coverage or crashes as proxies for functional correctness. The controlled precise/vague split and the focus on requirement-specific fault detection are the clearest advances. The setup lets readers see how requirement clarity affects generation strategies in a reproducible way. The main soft spot is that the abstract supplies no operational details on the new metric. There is no example of extracting a property from vague versus precise text, no description of how mutants are selected or scored to match requirement violations, and no numbers on LLMs tested, test volume, or statistical checks. The manual verification of the requirements is mentioned but not explained, so it is unclear how independent or complete they are. If those pieces are not handled carefully, the reported drops in effectiveness could be hard to attribute cleanly to the generation approaches rather than to choices in property formulation or mutant design. This work is aimed at researchers building or evaluating LLM tools for API testing. Readers who need better benchmarks than coverage alone will find concrete material to use or extend. It deserves peer review because the core idea targets a genuine evaluation gap and the results are stated plainly, even though the methods will need expansion and scrutiny to support the claims.

Referee Report

3 major / 3 minor

Summary. The paper introduces RESTestBench, a benchmark with three REST services and manually verified NL requirements in precise and vague variants, plus a requirements-based mutation testing metric extending Bartocci et al.'s property-based approach. It evaluates LLM-based test generation (non-refinement vs. refinement with valid/mutated SUT interaction) and claims that effectiveness drops substantially on mutated code—especially for vague requirements—sometimes negating refinement benefits, while high-detail requirements render SUT interaction unnecessary.

Significance. If the metric and requirements hold, the work provides a needed functional-behavior-focused alternative to coverage/crash metrics for assessing LLM-generated REST tests from NL specs. The controlled precise/vague variants and valid/mutated SUT conditions are a clear strength for isolating when refinement helps or harms, with potential to guide practical tool design in API testing.

major comments (3)

[Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.
[§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.
[§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.

minor comments (3)

[Abstract] The abstract would be clearer if it stated the exact number of services, LLMs, and test cases upfront rather than leaving them implicit.
[§3.2] Notation for the mutation metric (e.g., how effectiveness is aggregated across requirements) should be defined explicitly in a dedicated subsection or equation.
[§2] Related-work discussion of Bartocci et al. could include a brief comparison table of the original property-based approach versus the new requirements-based extension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the identification of areas where additional clarity and detail will strengthen the presentation of RESTestBench and its evaluation. We address each major comment below and commit to revisions that make the work more verifiable without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.

Authors: We agree that these experimental details are essential for verifiability and were insufficiently specified in the current draft. In the revised manuscript we will expand §4 (and update the abstract) to report: the precise LLMs and versions evaluated, the total number of test cases generated under each condition (non-refinement, refinement with valid SUT, refinement with mutated SUT), the statistical tests performed (including p-values and effect sizes), and a complete description of the mutation-metric scoring procedure, including the mutant-selection criteria used to target requirement violations. These additions will allow readers to reproduce and assess the reported effectiveness drops. revision: yes
Referee: [§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.

Authors: We acknowledge that the current description of the metric remains too high-level. We will revise §3.2 to include an operational definition of the requirements-based mutation score, a concrete worked example that extracts properties from both a vague and a precise requirement, the mutant-generation process (including the operators chosen and how they are aligned with specific requirement violations), and the exact scoring rules. These additions will clarify how the metric isolates the effects of requirement detail and SUT mutation from unrelated faults. revision: yes
Referee: [§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.

Authors: We agree that the verification process for the NL requirements must be documented. In the revised §3.1 we will describe the manual verification steps performed, any consistency checks applied (including inter-rater procedures if multiple reviewers were involved), and the measures taken to ensure requirements were derived from specifications and documentation independently of the implementation. If formal inter-rater metrics were not computed in the original process, we will either add them or explicitly note this as a limitation while still providing the verification protocol used. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and metric evaluation is self-contained

full rationale

The paper introduces RESTestBench as a new benchmark with manually verified NL requirements (precise/vague variants) and defines a requirements-based mutation testing metric as an explicit extension of the external property-based approach from Bartocci et al. Central claims about test effectiveness under refinement (with/without mutated SUT) are obtained directly from controlled experiments on three REST services using state-of-the-art LLMs; no equations, parameters, or results are fitted to the target outcomes and then relabeled as predictions. The single external citation is not self-referential, does not bear the uniqueness or definitional load, and is not required to close any derivation loop. The work is therefore an independent empirical contribution whose validity rests on the experimental design rather than on any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the validity of the newly introduced benchmark and metric, with key domain assumptions about requirement accuracy and metric effectiveness; no free parameters or invented entities with independent evidence are introduced.

axioms (2)

domain assumption Manually verified NL requirements accurately represent the intended functional behavior of the REST services
The entire evaluation and metric depend on these requirements being correct for both precise and vague variants.
domain assumption The requirements-based mutation testing metric correctly measures fault-detection effectiveness for a specific requirement
This is the core extension of the property-based approach from Bartocci et al. used to support the results.

invented entities (2)

RESTestBench benchmark no independent evidence
purpose: To enable controlled and reproducible evaluation of requirement-based LLM test generation
Newly proposed in this work with three specific services and requirement variants.
requirements-based mutation testing metric no independent evidence
purpose: To measure how well a generated test validates a specific requirement by detecting related mutations
Introduced as an extension of prior property-based testing to address the evaluation gap.

pith-pipeline@v0.9.0 · 5540 in / 1533 out tokens · 85282 ms · 2026-05-07T15:53:13.844153+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 31 canonical work pages

[1]

The Mother of All Demo Apps

[n. d.]. Realworld-Apps/Realworld: "The Mother of All Demo Apps" — Exemplary Fullstack Medium.Com Clone Powered by React, Angular, Node, Django, and Many More. https://github.com/realworld-apps/realworld?tab=readme-ov-file
[2]

Fastapi/Full-Stack-Fastapi-Template

2026. Fastapi/Full-Stack-Fastapi-Template. FastAPI

2026
[3]

Alonso, Sergio Segura, and Antonio Ruiz-Cortés

Juan C. Alonso, Sergio Segura, and Antonio Ruiz-Cortés. 2023. AGORA: Auto- mated Generation of Test Oracles for REST APIs. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Seattle WA USA, 1018–1030. doi:10.1145/3597926.3598114

work page doi:10.1145/3597926.3598114 2023
[4]

2017.Introduction to Software Testing(2nd ed ed.)

Paul Ammann and Jeff Offutt. 2017.Introduction to Software Testing(2nd ed ed.). Cambridge university press, Cambridge

2017
[5]

Henry Andrews. 2026. OpenAPI Specification. Kogler L., Hangler S., Ehrhart M., Dornauer B., Wuersching R., and Schrammel P

2026
[6]

Andrea Arcuri. 2021. Automated Black- and White-Box Testing of RESTful APIs With EvoMaster.IEEE Software38, 3 (May 2021), 72–78. doi:10.1109/MS.2020. 3013820

work page doi:10.1109/ms.2020 2021
[7]

Chetan Arora, Tomas Herda, and Verena Homm. 2024. Generating Test Scenarios from NL Requirements Using Retrieval-Augmented LLMs: An Industrial Study. In 2024 IEEE 32nd International Requirements Engineering Conference (RE). 240–251. arXiv:2404.12772 [cs] doi:10.1109/RE59067.2024.00031

work page doi:10.1109/re59067.2024.00031 2024
[8]

Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. RESTler: Stateful REST API Fuzzing. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, Montreal, QC, Canada, 748–758. doi:10.1109/ ICSE.2019.00083

work page arXiv 2019
[9]

Cristian Augusto, Antonia Bertolino, Guglielmo De Angelis, Francesca Lonetti, and Jesús Morán. 2025. Large Language Models for Software Testing: A Research Roadmap. doi:10.48550/ARXIV.2509.25043

work page doi:10.48550/arxiv.2509.25043 2025
[10]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
[11]

doi:10.1109/TSE.2014.2372785

The Oracle Problem in Software Testing: A Survey.IEEE Transactions on Software Engineering41, 5 (May 2015), 507–525. doi:10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015
[12]

Thiago Barradas, Aline Paes, and Vânia de Oliveira Neves. 2025. Combining TSL and LLM to Automate REST API Testing: A Comparative Study. doi:10.48550/ ARXIV.2509.05540

work page arXiv 2025
[13]

Ezio Bartocci, Leonardo Mariani, Dejan Ničković, and Drishti Yadav. 2023. Property-Based Mutation Testing. In2023 IEEE Conference on Software Test- ing, Verification and Validation (ICST). IEEE, Dublin, Ireland, 222–233. doi:10. 1109/ICST57152.2023.00029

work page arXiv 2023
[14]

2012.Writing Effective Use Cases(24

Alistair Cockburn. 2012.Writing Effective Use Cases(24. print ed.). Addison- Wesley, Boston

2012
[15]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and An- thony Ventresque. 2016. PIT: A Practical Mutation Testing Tool for Java (Demo). InProceedings of the 25th International Symposium on Software Testing and Anal- ysis. ACM, Saarbrücken Germany, 449–452. doi:10.1145/2931037.2948707

work page doi:10.1145/2931037.2948707 2016
[16]

Alix Decrop, Sara Eraso, Xavier Devroey, and Gilles Perrouin. 2025. A Public Benchmark of REST APIs. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, Ottawa, ON, Canada, 421–433. doi:10. 1109/MSR66628.2025.00072

work page arXiv 2025
[17]

Serge Demeyer, Ali Parsai, Sten Vercammen, Brent Van Bladel, and Mehrdad Abdi
[18]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis,

Formal Verification of Developer Tests: A Research Agenda Inspired by Mutation Testing. InLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles, Tiziana Margaria and Bernhard Steffen (Eds.). Vol. 12477. Springer International Publishing, Cham, 9–24. doi:10.1007/978-3- 030-61470-6_2

work page doi:10.1007/978-3-
[19]

David Fowler. 2026. Davidfowl/TodoApp

2026
[21]

Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2024. Testing RESTful APIs: A Survey.ACM Transactions on Software Engineering and Methodology33, 1 (Jan. 2024), 1–41. doi:10.1145/3617175

work page doi:10.1145/3617175 2024
[22]

Alex Groce, Josie Holmes, Darko Marinov, August Shi, and Lingming Zhang
[23]

InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings

An Extensible, Regular-Expression-Based Tool for Multi-Language Mu- tant Generation. InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings. ACM, Gothenburg Sweden, 25–28. doi:10.1145/3183440.3183485

work page doi:10.1145/3183440.3183485
[24]

Soneya Binta Hossain and Matthew B. Dwyer. 2022. A Brief Survey on Oracle- based Test Adequacy Metrics. doi:10.48550/ARXIV.2212.06118

work page doi:10.48550/arxiv.2212.06118 2022
[25]

Linghan Huang, Peizhou Zhao, Huaming Chen, and Lei Ma. 2024. On the Challenges of Fuzzing Techniques via Large Language Models. doi:10.48550/ ARXIV.2402.00350

work page arXiv 2024
[26]

Laura Inozemtseva and Reid Holmes. 2014. Coverage Is Not Strongly Correlated with Test Suite Effectiveness. InProceedings of the 36th International Conference on Software Engineering. ACM, Hyderabad India, 435–445. doi:10.1145/2568225. 2568271

work page doi:10.1145/2568225 2014
[27]

2005.Object-Oriented Software Engineering: A Use Case Driven Approach

Ivar Jacobson. 2005.Object-Oriented Software Engineering: A Use Case Driven Approach. Addison-Wesley Pub, Reading, Mass

2005
[28]

Lukas Jakob. 2026. Lujakob/Nestjs-Realworld-Example-App

2026
[29]

Leon Kogler, Maximilian Ehrhart, Benedikt Dornauer, and Eduard Paul Enoiu
[30]

RESTifAI: LLM-Based Workflow for Reusable REST API Testing. doi:10. 48550/ARXIV.2512.08706

work page arXiv
[31]

Michael Konstantinou, Renzo Degiovanni, and Mike Papadakis. 2024. Do LLMs Generate Test Oracles That Capture the Actual or the Expected Program Be- haviour? doi:10.48550/ARXIV.2410.21136

work page doi:10.48550/arxiv.2410.21136 2024
[32]

Nan Li and Jeff Offutt. 2017. Test Oracle Strategies for Model-Based Testing. IEEE Transactions on Software Engineering43, 4 (April 2017), 372–395. doi:10. 1109/TSE.2016.2597136

work page arXiv 2017
[33]

Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortés. 2021. RESTest: Automated Black-Box Testing of RESTful Web APIs. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Denmark, 682–685. doi:10.1145/3460319.3469082

work page doi:10.1145/3460319.3469082 2021
[34]

Mohammed Mudassir and Mohammed Mushtaq. 2024. The Role of APIs in Mod- ern Software Development.World Journal of Advanced Engineering Technology and Sciences13, 1 (Oct. 2024), 1045–1047. doi:10.30574/wjaets.2024.13.1.0515

work page doi:10.30574/wjaets.2024.13.1.0515 2024
[35]

Rangeet Pan, Raju Pavuluri, Ruikai Huang, Rahul Krishna, Tyler Stennett, Alessandro Orso, and Saurabh SInha. 2025. SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents. doi:10.48550/ ARXIV.2511.13305

work page arXiv 2025
[36]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation Testing Advances: An Analysis and Survey. InAdvances in Computers. Vol. 112. Elsevier, 275–378. doi:10.1016/bs.adcom.2018.03.015

work page doi:10.1016/bs.adcom.2018.03.015 2019
[37]

Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection?: A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. InProceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg Sweden, 537–548. doi:10.1145/3180155.3180183

work page doi:10.1145/3180155.3180183 2018
[38]

André Pereira, Bruno Lima, and João Pascoal Faria. 2024. APITestGenie: Auto- mated API Test Generation through Generative AI. doi:10.48550/ARXIV.2409. 03838

work page doi:10.48550/arxiv.2409 2024
[39]

Postman. [n. d.].State of the API Report. Technical Report
[40]

H. G. Rice. 1953. Classes of Recursively Enumerable Sets and Their Decision Problems.Trans. Amer. Math. Soc.74, 2 (1953), 358–366. doi:10.1090/S0002-9947- 1953-0053041-6

work page doi:10.1090/s0002-9947- 1953
[41]

Sánchez, José A

Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Pa- padakis. 2024. Mutation Testing in Practice: Insights From Open-Source Software Developers.IEEE Transactions on Software Engineering50, 5 (May 2024), 1130–

2024
[42]

doi:10.1109/TSE.2024.3377378

work page doi:10.1109/tse.2024.3377378 2024
[43]

Van Lamsweerde

A. Van Lamsweerde. 2000. Goal-Oriented Requirements Engineering: A Guided Tour. InProceedings Fifth IEEE International Symposium on Requirements Engi- neering. IEEE Comput. Soc, Toronto, Ont., Canada, 249–262. doi:10.1109/ISRE. 2001.948567

work page doi:10.1109/isre 2000
[44]

Dries Vanoverberghe, Jonathan De Halleux, Nikolai Tillmann, and Frank Piessens
[45]

In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.)

State Coverage: Software Validation Metrics beyond Code Coverage. In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.). Vol. 7147. Springer Berlin Heidelberg, Berlin, Heidelberg, 542–553. doi:10.1007/ 978-3-642-27660-6_44

2012
[46]

Wiegers, Joy Beatty, and Karl E

Karl E. Wiegers, Joy Beatty, and Karl E. Wiegers. 2013.Software Requirements(3. ed. [fully updated and expanded] ed.). Microsoft Press, Redmond, Wash

2013
[47]

Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-Based Test Generation: A Comprehensive Survey. doi:10.48550/ ARXIV.2505.02015

work page arXiv 2025
[48]

Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, and Xin Peng. 2025. LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents. doi:10.48550/ARXIV.2503.15079

work page doi:10.48550/arxiv.2503.15079 2025
[49]

On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution

Didar Zowghi and Vincenzo Gervasi. 2004. Erratum to “On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution”.Information and Software Technology46, 11 (Sept. 2004), 763–779. doi:10.1016/j.infsof.2004.03.003 Received 02 March 2026; accepted 21 April 2026

work page doi:10.1016/j.infsof.2004.03.003 2004