pith. machine review for the scientific record. sign in

arxiv: 2604.25862 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.AI

Recognition: unknown

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:53 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords REST API testingLLM-generated testsnatural language requirementsmutation testingtest case generationbenchmarksoftware testing
0
0 comments X

The pith

RESTestBench shows LLM-generated REST API tests lose effectiveness when interacting with faulty code, especially under vague requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RESTestBench, a benchmark with three REST services and manually verified natural language requirements in both precise and vague versions. It defines a requirements-based mutation testing metric to measure how well a generated test detects faults tied to each specific requirement. Evaluations compare plain LLM generation against refinement that lets the model query the running service, including mutated versions. The central finding is that effectiveness falls sharply with faulty or mutated code and vague requirements, sometimes removing any gain from refinement.

Core claim

Using RESTestBench, we show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

What carries the argument

The requirements-based mutation testing metric, which scores how effectively a test case detects mutations that violate a given natural language requirement.

If this is right

  • Precise natural language requirements yield higher test effectiveness than vague ones across the tested LLMs.
  • Refinement via interaction with the running service improves effectiveness only when requirements are vague and code is correct.
  • When requirements are detailed, direct generation without SUT interaction performs as well as or better than refinement.
  • Test effectiveness is sensitive to the presence of faults in the code observed during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could focus effort on writing precise requirements rather than building complex refinement loops for LLM test generators.
  • Applying the same benchmark and metric to non-REST APIs or different mutation operators would test whether the pattern generalizes.
  • The results suggest exploring whether static analysis or other non-execution context could substitute for runtime interaction in high-detail cases.

Load-bearing premise

The manually verified natural language requirements accurately and completely capture the intended functional behavior of the REST services, and the mutation metric correctly measures requirement-specific fault detection.

What would settle it

An experiment in which refinement on mutated code produces equal or higher fault-detection rates for vague requirements than non-refined generation would falsify the reported drop in effectiveness.

Figures

Figures reproduced from arXiv: 2604.25862 by Benedikt Dornauer, Leon Kogler, Maximilian Ehrhart, Peter Schrammel, Roland Wuersching, Stefan Hangler.

Figure 1
Figure 1. Figure 1: RESTestBench overview 3.1 Selection of suitable REST API Services The benchmark services were selected to support controlled, re￾producible experimentation while remaining representative of real￾world REST backends. Practical relevance refers to selecting ser￾vices that resemble production-style REST backends in terms of API surface, security mechanisms, and that include complex operation dependencies requ… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the evaluation of 𝑘 valid 𝜙𝑖 and 𝑘 𝑚 𝜙𝑖 for a single requirement 𝜙𝑖 in refinement-based test generation approaches. Summing over all requirements, we define 𝑘 𝑚 𝜙 = ∑︁ 𝑖 𝑘 𝑚 𝜙𝑖 , 𝑘valid 𝜙 = ∑︁ 𝑖 𝑘 valid 𝜙𝑖 . Based on this distinction, we compute two mutation score values for refinement-based approaches: 𝑀𝑆valid 𝜙 = 𝑘 valid 𝜙 |𝑀| and 𝑀𝑆m 𝜙 = 𝑘 𝑚 𝜙 |𝑀| By computing both scores for refinement… view at source ↗
Figure 3
Figure 3. Figure 3: Mutation scores for vague and precise requirements comparing single-step 𝑀𝑆𝜙 (solid bars) with refinement-based scores 𝑀𝑆valid 𝜙 (forward-hatched) and 𝑀𝑆m 𝜙 (cross-hatched). Bars are stacked by service (FastAPI, RealWorld, TodoApp); segment height reflects weighted service contribution and segment percentages are per-service mutation scores. substantial time and computational cost of the full benchmark, ea… view at source ↗
Figure 4
Figure 4. Figure 4: Mean total generation cost (USD) per benchmark view at source ↗
read the original abstract

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces RESTestBench, a benchmark with three REST services and manually verified NL requirements in precise and vague variants, plus a requirements-based mutation testing metric extending Bartocci et al.'s property-based approach. It evaluates LLM-based test generation (non-refinement vs. refinement with valid/mutated SUT interaction) and claims that effectiveness drops substantially on mutated code—especially for vague requirements—sometimes negating refinement benefits, while high-detail requirements render SUT interaction unnecessary.

Significance. If the metric and requirements hold, the work provides a needed functional-behavior-focused alternative to coverage/crash metrics for assessing LLM-generated REST tests from NL specs. The controlled precise/vague variants and valid/mutated SUT conditions are a clear strength for isolating when refinement helps or harms, with potential to guide practical tool design in API testing.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.
  2. [§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.
  3. [§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.
minor comments (3)
  1. [Abstract] The abstract would be clearer if it stated the exact number of services, LLMs, and test cases upfront rather than leaving them implicit.
  2. [§3.2] Notation for the mutation metric (e.g., how effectiveness is aggregated across requirements) should be defined explicitly in a dedicated subsection or equation.
  3. [§2] Related-work discussion of Bartocci et al. could include a brief comparison table of the original property-based approach versus the new requirements-based extension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the identification of areas where additional clarity and detail will strengthen the presentation of RESTestBench and its evaluation. We address each major comment below and commit to revisions that make the work more verifiable without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claims about effectiveness drops and negation of refinement benefits rest on empirical results, yet no details are supplied on the number of LLMs tested, total test cases generated per condition, statistical tests used, or exact scoring procedure for the requirements-based mutation metric (e.g., how mutants are chosen to target specific requirement violations). This leaves the reported differences unverifiable and load-bearing for the conclusions.

    Authors: We agree that these experimental details are essential for verifiability and were insufficiently specified in the current draft. In the revised manuscript we will expand §4 (and update the abstract) to report: the precise LLMs and versions evaluated, the total number of test cases generated under each condition (non-refinement, refinement with valid SUT, refinement with mutated SUT), the statistical tests performed (including p-values and effect sizes), and a complete description of the mutation-metric scoring procedure, including the mutant-selection criteria used to target requirement violations. These additions will allow readers to reproduce and assess the reported effectiveness drops. revision: yes

  2. Referee: [§3.2] §3.2 (Requirements-Based Mutation Testing Metric): The metric is presented as extending Bartocci et al. to quantify fault detection w.r.t. a specific NL requirement, but supplies no operational definition, no worked example of property extraction from vague versus precise text, and no description of mutant generation or scoring to ensure mutants correspond to requirement violations rather than unrelated faults. Without this, differences attributed to requirement detail or SUT mutation cannot be isolated from metric artifacts.

    Authors: We acknowledge that the current description of the metric remains too high-level. We will revise §3.2 to include an operational definition of the requirements-based mutation score, a concrete worked example that extracts properties from both a vague and a precise requirement, the mutant-generation process (including the operators chosen and how they are aligned with specific requirement violations), and the exact scoring rules. These additions will clarify how the metric isolates the effects of requirement detail and SUT mutation from unrelated faults. revision: yes

  3. Referee: [§3.1] §3.1 (NL Requirements): The benchmark relies on 'manually verified' requirements as ground truth for functional intent, yet reports no verification procedure, inter-rater checks, or evidence that the requirements are independent of the implementation. This directly affects the weakest assumption underlying the metric and the precise/vague comparison.

    Authors: We agree that the verification process for the NL requirements must be documented. In the revised §3.1 we will describe the manual verification steps performed, any consistency checks applied (including inter-rater procedures if multiple reviewers were involved), and the measures taken to ensure requirements were derived from specifications and documentation independently of the implementation. If formal inter-rater metrics were not computed in the original process, we will either add them or explicitly note this as a limitation while still providing the verification protocol used. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and metric evaluation is self-contained

full rationale

The paper introduces RESTestBench as a new benchmark with manually verified NL requirements (precise/vague variants) and defines a requirements-based mutation testing metric as an explicit extension of the external property-based approach from Bartocci et al. Central claims about test effectiveness under refinement (with/without mutated SUT) are obtained directly from controlled experiments on three REST services using state-of-the-art LLMs; no equations, parameters, or results are fitted to the target outcomes and then relabeled as predictions. The single external citation is not self-referential, does not bear the uniqueness or definitional load, and is not required to close any derivation loop. The work is therefore an independent empirical contribution whose validity rests on the experimental design rather than on any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the validity of the newly introduced benchmark and metric, with key domain assumptions about requirement accuracy and metric effectiveness; no free parameters or invented entities with independent evidence are introduced.

axioms (2)
  • domain assumption Manually verified NL requirements accurately represent the intended functional behavior of the REST services
    The entire evaluation and metric depend on these requirements being correct for both precise and vague variants.
  • domain assumption The requirements-based mutation testing metric correctly measures fault-detection effectiveness for a specific requirement
    This is the core extension of the property-based approach from Bartocci et al. used to support the results.
invented entities (2)
  • RESTestBench benchmark no independent evidence
    purpose: To enable controlled and reproducible evaluation of requirement-based LLM test generation
    Newly proposed in this work with three specific services and requirement variants.
  • requirements-based mutation testing metric no independent evidence
    purpose: To measure how well a generated test validates a specific requirement by detecting related mutations
    Introduced as an extension of prior property-based testing to address the evaluation gap.

pith-pipeline@v0.9.0 · 5540 in / 1533 out tokens · 85282 ms · 2026-05-07T15:53:13.844153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 31 canonical work pages

  1. [1]

    The Mother of All Demo Apps

    [n. d.]. Realworld-Apps/Realworld: "The Mother of All Demo Apps" — Exemplary Fullstack Medium.Com Clone Powered by React, Angular, Node, Django, and Many More. https://github.com/realworld-apps/realworld?tab=readme-ov-file

  2. [2]

    Fastapi/Full-Stack-Fastapi-Template

    2026. Fastapi/Full-Stack-Fastapi-Template. FastAPI

  3. [3]

    Alonso, Sergio Segura, and Antonio Ruiz-Cortés

    Juan C. Alonso, Sergio Segura, and Antonio Ruiz-Cortés. 2023. AGORA: Auto- mated Generation of Test Oracles for REST APIs. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Seattle WA USA, 1018–1030. doi:10.1145/3597926.3598114

  4. [4]

    2017.Introduction to Software Testing(2nd ed ed.)

    Paul Ammann and Jeff Offutt. 2017.Introduction to Software Testing(2nd ed ed.). Cambridge university press, Cambridge

  5. [5]

    Henry Andrews. 2026. OpenAPI Specification. Kogler L., Hangler S., Ehrhart M., Dornauer B., Wuersching R., and Schrammel P

  6. [6]

    Andrea Arcuri. 2021. Automated Black- and White-Box Testing of RESTful APIs With EvoMaster.IEEE Software38, 3 (May 2021), 72–78. doi:10.1109/MS.2020. 3013820

  7. [7]

    Chetan Arora, Tomas Herda, and Verena Homm. 2024. Generating Test Scenarios from NL Requirements Using Retrieval-Augmented LLMs: An Industrial Study. In 2024 IEEE 32nd International Requirements Engineering Conference (RE). 240–251. arXiv:2404.12772 [cs] doi:10.1109/RE59067.2024.00031

  8. [8]

    Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. RESTler: Stateful REST API Fuzzing. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, Montreal, QC, Canada, 748–758. doi:10.1109/ ICSE.2019.00083

  9. [9]

    Cristian Augusto, Antonia Bertolino, Guglielmo De Angelis, Francesca Lonetti, and Jesús Morán. 2025. Large Language Models for Software Testing: A Research Roadmap. doi:10.48550/ARXIV.2509.25043

  10. [10]

    Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

    Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

  11. [11]

    doi:10.1109/TSE.2014.2372785

    The Oracle Problem in Software Testing: A Survey.IEEE Transactions on Software Engineering41, 5 (May 2015), 507–525. doi:10.1109/TSE.2014.2372785

  12. [12]

    Thiago Barradas, Aline Paes, and Vânia de Oliveira Neves. 2025. Combining TSL and LLM to Automate REST API Testing: A Comparative Study. doi:10.48550/ ARXIV.2509.05540

  13. [13]

    Ezio Bartocci, Leonardo Mariani, Dejan Ničković, and Drishti Yadav. 2023. Property-Based Mutation Testing. In2023 IEEE Conference on Software Test- ing, Verification and Validation (ICST). IEEE, Dublin, Ireland, 222–233. doi:10. 1109/ICST57152.2023.00029

  14. [14]

    2012.Writing Effective Use Cases(24

    Alistair Cockburn. 2012.Writing Effective Use Cases(24. print ed.). Addison- Wesley, Boston

  15. [15]

    Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and An- thony Ventresque. 2016. PIT: A Practical Mutation Testing Tool for Java (Demo). InProceedings of the 25th International Symposium on Software Testing and Anal- ysis. ACM, Saarbrücken Germany, 449–452. doi:10.1145/2931037.2948707

  16. [16]

    Alix Decrop, Sara Eraso, Xavier Devroey, and Gilles Perrouin. 2025. A Public Benchmark of REST APIs. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, Ottawa, ON, Canada, 421–433. doi:10. 1109/MSR66628.2025.00072

  17. [17]

    Serge Demeyer, Ali Parsai, Sten Vercammen, Brent Van Bladel, and Mehrdad Abdi

  18. [18]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis,

    Formal Verification of Developer Tests: A Research Agenda Inspired by Mutation Testing. InLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles, Tiziana Margaria and Bernhard Steffen (Eds.). Vol. 12477. Springer International Publishing, Cham, 9–24. doi:10.1007/978-3- 030-61470-6_2

  19. [19]

    David Fowler. 2026. Davidfowl/TodoApp

  20. [21]

    Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2024. Testing RESTful APIs: A Survey.ACM Transactions on Software Engineering and Methodology33, 1 (Jan. 2024), 1–41. doi:10.1145/3617175

  21. [22]

    Alex Groce, Josie Holmes, Darko Marinov, August Shi, and Lingming Zhang

  22. [23]

    InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings

    An Extensible, Regular-Expression-Based Tool for Multi-Language Mu- tant Generation. InProceedings of the 40th International Conference on Soft- ware Engineering: Companion Proceeedings. ACM, Gothenburg Sweden, 25–28. doi:10.1145/3183440.3183485

  23. [24]

    Soneya Binta Hossain and Matthew B. Dwyer. 2022. A Brief Survey on Oracle- based Test Adequacy Metrics. doi:10.48550/ARXIV.2212.06118

  24. [25]

    Linghan Huang, Peizhou Zhao, Huaming Chen, and Lei Ma. 2024. On the Challenges of Fuzzing Techniques via Large Language Models. doi:10.48550/ ARXIV.2402.00350

  25. [26]

    Laura Inozemtseva and Reid Holmes. 2014. Coverage Is Not Strongly Correlated with Test Suite Effectiveness. InProceedings of the 36th International Conference on Software Engineering. ACM, Hyderabad India, 435–445. doi:10.1145/2568225. 2568271

  26. [27]

    2005.Object-Oriented Software Engineering: A Use Case Driven Approach

    Ivar Jacobson. 2005.Object-Oriented Software Engineering: A Use Case Driven Approach. Addison-Wesley Pub, Reading, Mass

  27. [28]

    Lukas Jakob. 2026. Lujakob/Nestjs-Realworld-Example-App

  28. [29]

    Leon Kogler, Maximilian Ehrhart, Benedikt Dornauer, and Eduard Paul Enoiu

  29. [30]

    RESTifAI: LLM-Based Workflow for Reusable REST API Testing. doi:10. 48550/ARXIV.2512.08706

  30. [31]

    Michael Konstantinou, Renzo Degiovanni, and Mike Papadakis. 2024. Do LLMs Generate Test Oracles That Capture the Actual or the Expected Program Be- haviour? doi:10.48550/ARXIV.2410.21136

  31. [32]

    Nan Li and Jeff Offutt. 2017. Test Oracle Strategies for Model-Based Testing. IEEE Transactions on Software Engineering43, 4 (April 2017), 372–395. doi:10. 1109/TSE.2016.2597136

  32. [33]

    Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortés. 2021. RESTest: Automated Black-Box Testing of RESTful Web APIs. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Denmark, 682–685. doi:10.1145/3460319.3469082

  33. [34]

    Mohammed Mudassir and Mohammed Mushtaq. 2024. The Role of APIs in Mod- ern Software Development.World Journal of Advanced Engineering Technology and Sciences13, 1 (Oct. 2024), 1045–1047. doi:10.30574/wjaets.2024.13.1.0515

  34. [35]

    Rangeet Pan, Raju Pavuluri, Ruikai Huang, Rahul Krishna, Tyler Stennett, Alessandro Orso, and Saurabh SInha. 2025. SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents. doi:10.48550/ ARXIV.2511.13305

  35. [36]

    Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation Testing Advances: An Analysis and Survey. InAdvances in Computers. Vol. 112. Elsevier, 275–378. doi:10.1016/bs.adcom.2018.03.015

  36. [37]

    Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection?: A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. InProceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg Sweden, 537–548. doi:10.1145/3180155.3180183

  37. [38]

    André Pereira, Bruno Lima, and João Pascoal Faria. 2024. APITestGenie: Auto- mated API Test Generation through Generative AI. doi:10.48550/ARXIV.2409. 03838

  38. [39]

    Postman. [n. d.].State of the API Report. Technical Report

  39. [40]

    H. G. Rice. 1953. Classes of Recursively Enumerable Sets and Their Decision Problems.Trans. Amer. Math. Soc.74, 2 (1953), 358–366. doi:10.1090/S0002-9947- 1953-0053041-6

  40. [41]

    Sánchez, José A

    Ana B. Sánchez, José A. Parejo, Sergio Segura, Amador Durán, and Mike Pa- padakis. 2024. Mutation Testing in Practice: Insights From Open-Source Software Developers.IEEE Transactions on Software Engineering50, 5 (May 2024), 1130–

  41. [42]

    doi:10.1109/TSE.2024.3377378

  42. [43]

    Van Lamsweerde

    A. Van Lamsweerde. 2000. Goal-Oriented Requirements Engineering: A Guided Tour. InProceedings Fifth IEEE International Symposium on Requirements Engi- neering. IEEE Comput. Soc, Toronto, Ont., Canada, 249–262. doi:10.1109/ISRE. 2001.948567

  43. [44]

    Dries Vanoverberghe, Jonathan De Halleux, Nikolai Tillmann, and Frank Piessens

  44. [45]

    In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.)

    State Coverage: Software Validation Metrics beyond Code Coverage. In SOFSEM 2012: Theory and Practice of Computer Science, Mária Bieliková, Ger- hard Friedrich, Georg Gottlob, Stefan Katzenbeisser, and György Turán (Eds.). Vol. 7147. Springer Berlin Heidelberg, Berlin, Heidelberg, 542–553. doi:10.1007/ 978-3-642-27660-6_44

  45. [46]

    Wiegers, Joy Beatty, and Karl E

    Karl E. Wiegers, Joy Beatty, and Karl E. Wiegers. 2013.Software Requirements(3. ed. [fully updated and expanded] ed.). Microsoft Press, Redmond, Wash

  46. [47]

    Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-Based Test Generation: A Comprehensive Survey. doi:10.48550/ ARXIV.2505.02015

  47. [48]

    Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, and Xin Peng. 2025. LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents. doi:10.48550/ARXIV.2503.15079

  48. [49]

    On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution

    Didar Zowghi and Vincenzo Gervasi. 2004. Erratum to “On the Interplay between Consistency, Completeness, and Correctness in Requirements Evo- lution”.Information and Software Technology46, 11 (Sept. 2004), 763–779. doi:10.1016/j.infsof.2004.03.003 Received 02 March 2026; accepted 21 April 2026