Multi-Agent LLM-based Metamorphic Testing for REST APIs

Abdullah Mughees; Dragos Truscan; Gaadha Sudheerbabu; Shehroz Khan; Tanwir Ahmad

arxiv: 2605.28321 · v1 · pith:BTWL6HNLnew · submitted 2026-05-27 · 💻 cs.SE · cs.AI

Multi-Agent LLM-based Metamorphic Testing for REST APIs

Shehroz Khan , Abdullah Mughees , Gaadha Sudheerbabu , Tanwir Ahmad , Dragos Truscan This is my paper

Pith reviewed 2026-06-29 10:50 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords metamorphic testingREST APIsLLM agentsmulti-agent workflowOpenAPItest oracle problemAPI testingsoftware quality

0 comments

The pith

ARMeta uses a multi-agent LLM workflow to generate metamorphic tests that complement scenario-based testing for REST APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARMeta, a tool that applies metamorphic testing to REST APIs specified in OpenAPI by using multiple LLM agents to create test scenarios. Metamorphic testing checks consistency between related outputs instead of requiring a known correct result for each call. The agents identify relations, express them in Given-When-Then format, turn them into runnable tests, and execute those tests on the target system. When run on two public web applications, the generated tests uncovered behaviors not reached by a standard scenario-based testing method. A reader would care because many REST APIs lack clear test oracles, making it hard to know whether an output is right.

Core claim

ARMeta is a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

What carries the argument

LLM-based multi-agent workflow that identifies metamorphic relations and encodes them as executable Given-When-Then tests for OpenAPI-described REST APIs.

If this is right

Metamorphic testing can be applied to REST APIs even when no explicit oracle exists for individual responses.
The generated tests run automatically once the scenarios are turned into code.
The method can be compared directly against scenario-based testing on any OpenAPI-documented service.
Additional behaviors become reachable without manual definition of every test case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the agent workflow scales without extra human review, manual effort for writing metamorphic relations could drop for teams that already use OpenAPI specs.
The same multi-agent pattern might transfer to property-based testing or consistency checks in other API styles such as GraphQL.
Integration into CI pipelines would allow repeated runs to catch behavioral drift after code changes.

Load-bearing premise

The LLM agents reliably produce valid metamorphic relations that do not introduce false positives or miss critical behaviors when applied to real REST APIs.

What would settle it

An execution of ARMeta on one of the two evaluated applications in which a generated metamorphic relation produces a false positive or fails to flag a known fault in the API would show the approach does not reliably complement the baseline.

Figures

Figures reproduced from arXiv: 2605.28321 by Abdullah Mughees, Dragos Truscan, Gaadha Sudheerbabu, Shehroz Khan, Tanwir Ahmad.

**Figure 2.** Figure 2: High-level multi-agent architecture approach is iterative and incremental. A test generation session is composed of several iterations. In each iteration, a set of metamorphic tests is generated and executed. If the stopping criteria for test generation are not met, a subsequent iteration is initiated. Concretely, the Test Manager takes the REST API specification and the test generation parameters as inpu… view at source ↗

**Figure 3.** Figure 3: ARMeta tool displaying the passed and failed scenarios. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARMeta gives a concrete multi-agent LLM workflow for metamorphic REST API testing that complements scenario-based methods on the two apps tested, but the evaluation details needed to trust the relations are missing.

read the letter

The main thing here is a practical workflow called ARMeta that breaks metamorphic relation generation into LLM agents, outputs Given-When-Then scenarios from an OpenAPI spec, and turns them into runnable tests. That specific agent orchestration plus the end-to-end automation for REST services is the new piece; prior metamorphic testing work and LLM test generators do not line up exactly on this combination.

It does a clean job of tackling the oracle problem for APIs where expected outputs are hard to specify, and the automation from spec to execution is a real engineering win. The comparison to a scenario-based baseline on two public applications is straightforward and shows extra behaviors surfaced.

The soft spots are in the evaluation. The abstract and summary give no information on how the generated relations were checked for validity, whether they produce false positives, what the precise metrics were, or how the baseline was matched. Two applications is a narrow base for claiming complementarity, and the whole result rests on the unshown assumption that the LLM agents produce reliable relations. Without those checks the central claim stays provisional.

This is for people working on automated testing of web APIs who already use OpenAPI specs and want to extend coverage beyond hand-written scenarios. A testing researcher or practitioner could pull the workflow and try it on their own services.

It is worth sending to peer review. The implementation is concrete and the empirical angle is there; referees can ask for the missing validation steps and broader experiments.

Referee Report

2 major / 0 minor

Summary. The paper presents ARMeta, a multi-agent LLM workflow that generates metamorphic test scenarios in Given-When-Then format from OpenAPI specifications for REST APIs, automatically implements them as executable tests, and evaluates the approach on two public web applications against a scenario-based testing baseline, claiming that the generated behaviors serve as a complement to existing methods.

Significance. If the empirical results hold after proper validation of the metamorphic relations and fair baseline comparison, the work could advance automated testing for systems with difficult oracles by reducing the manual burden of specifying relations and demonstrating practical complementarity in REST API testing.

major comments (2)

[Abstract] Abstract: the central claim of complementarity rests on an empirical comparison, but the abstract (and by extension the evaluation) provides no details on how the generated metamorphic relations were validated for correctness, what metrics quantified the explored behaviors, or how the scenario-based baseline was matched in scope and effort.
[Evaluation] Evaluation (implied by abstract): without reported checks for false positives introduced by LLM-generated relations or coverage of critical behaviors missed by the baseline, the complementarity result cannot be assessed for reliability on real REST APIs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims. We address each major comment below and commit to revisions that strengthen the presentation of our validation and comparison methods without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of complementarity rests on an empirical comparison, but the abstract (and by extension the evaluation) provides no details on how the generated metamorphic relations were validated for correctness, what metrics quantified the explored behaviors, or how the scenario-based baseline was matched in scope and effort.

Authors: We agree that the abstract, as a concise summary, should better support the complementarity claim with high-level indicators of the underlying methods. The current evaluation describes the workflow and results on the two applications but does not explicitly detail validation steps, specific metrics, or baseline matching criteria in one place. In the revision we will update the abstract with a brief clause noting expert review for relation correctness, use of coverage and relation-count metrics, and scope matching via identical OpenAPI inputs and comparable scenario volumes; the evaluation section will be expanded with a dedicated paragraph consolidating these elements. revision: yes
Referee: [Evaluation] Evaluation (implied by abstract): without reported checks for false positives introduced by LLM-generated relations or coverage of critical behaviors missed by the baseline, the complementarity result cannot be assessed for reliability on real REST APIs.

Authors: We acknowledge that explicit reporting of false-positive mitigation and differential coverage analysis would improve assessability of the complementarity result. The manuscript currently reports aggregate outcomes and qualitative complementarity but does not include a systematic false-positive audit or enumeration of behaviors uniquely missed by the baseline. We will add these in revision by describing the manual filtering process used to discard invalid relations and by including a table or subsection contrasting unique state transitions and endpoint behaviors covered by ARMeta versus the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical tool (ARMeta) and its evaluation on two public REST applications against a scenario-based baseline. The central claim of complementarity rests on observed behaviors in that external comparison, with no equations, fitted parameters, self-defined quantities, or load-bearing self-citations that reduce the result to its own inputs by construction. The derivation chain is self-contained as a standard empirical study rather than a mathematical or definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM agents can be prompted to produce correct metamorphic relations without systematic bias or hallucination; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLM agents can be orchestrated to produce valid Given-When-Then metamorphic scenarios for REST APIs
Invoked in the description of the agentic workflow; no independent verification of agent output correctness is described in the abstract.

pith-pipeline@v0.9.1-grok · 5733 in / 1177 out tokens · 29693 ms · 2026-06-29T10:50:52.879252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages

[1]

Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,

S. Casas, D. Cruz, G. Vidal, and M. Constanzo, “Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,” in2021 40th International Conference of the Chilean Computer Science Society (SCCC). IEEE, 2021, pp. 1–8

2021
[2]

Metamorphic testing: a new approach for generating next test cases,

T. Y . Chenet al., “Metamorphic testing: a new approach for generating next test cases,”arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002
[3]

A new method for constructing metamorphic relations,

H. Liuet al., “A new method for constructing metamorphic relations,” in12th International Conference on Quality Software. IEEE, 2012, pp. 59–68

2012
[4]

Metamorphic testing for web services: Framework and a case study,

C.-A. Sun, G. Wang, B. Mu, H. Liu, Z. Wang, and T. Y . Chen, “Metamorphic testing for web services: Framework and a case study,” in 2011 IEEE international Conference on Web Services. IEEE, 2011, pp. 283–290

2011
[5]

Metamorphic testing of restful web apis,

S. Segura, J. Parejo, J. Troya, and A. Ruiz-Cort ´es, “Metamorphic testing of restful web apis,”IEEE Transactions on Software Engineering, vol. PP, pp. 1–1, 10 2017

2017
[6]

Metamorphic testing: A review of challenges and opportunities,

T. Y . Chen, F. C. Kuo, H. Liu, P. L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–27, 2018

2018
[7]

A survey on metamorphic testing,

S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,”IEEE Transactions on Software Engineering, vol. 42, no. 9, pp. 805–824, 2016

2016
[8]

Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,

Q. H. Luu, H. Liu, and T. Y . Chen, “Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,”arXiv preprint arXiv:2310.19204, 2023

work page arXiv 2023
[9]

Automated metamorphic-relation generation with chatgpt,

Y . Zhang, D. Towey, and M. Pike, “Automated metamorphic-relation generation with chatgpt,” inProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2023, pp. 1–6

2023
[10]

Towards generating executable metamorphic relations using large language models,

S. Y . Shin, F. Pastore, D. Bianculli, and A. Baicoianu, “Towards generating executable metamorphic relations using large language models,” inInternational Conference on the Quality of Information and Communications Technology. Springer, 2024, pp. 126–141

2024
[11]

Behave: Behavior-driven development (bdd) framework for python,

B. Rice and R. Jones, “Behave: Behavior-driven development (bdd) framework for python,” https://pypi.org/project/behave/, Python Package Index (PyPI), 2025, version 1.3.3

2025
[12]

Streamlit documentation,

“Streamlit documentation,” https://docs.streamlit.io/, Accessed: December 2025

2025
[13]

Testing restful apis: A survey,

A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing restful apis: A survey,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 1, pp. 1–41, 2023

2023
[14]

Logiagent: Automated logical testing for rest systems with llm-based multi-agents,

K. Zhanget al., “Logiagent: Automated logical testing for rest systems with llm-based multi-agents,”arXiv preprint arXiv:2503.15079, 2025

work page arXiv 2025
[15]

Restifai: Llm-based workflow for reusable rest api testing,

L. Kogler, M. Ehrhart, B. Dornauer, and E. P. Enoiu, “Restifai: Llm-based workflow for reusable rest api testing,”arXiv preprint arXiv:2512.08706, 2025

work page arXiv 2025
[16]

Liang et al

L. Liang, C. Tan, Y . Deng, Y . Cai, T. Y . Chen, and X. Zheng, “Automt: A multi-agent llm framework for automated metamorphic testing of autonomous driving systems,”arXiv preprint arXiv:2510.19438v1, 2025

work page arXiv 2025
[17]

Atil et al

B. Atil, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, G. R. Rajagopal, A. Sloan, T. Tudrej, F. Tureet al., “Non-determinism of ”deterministic” llm settings,”arXiv preprint arXiv:2408.04667, 2024

work page arXiv 2024
[18]

Wynne and A

M. Wynne and A. Hellesoy,”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

2012
[19]

[Online]

Behave Development Team,Behave Documentation, Read the Docs, 2026, accessed: 2026-01-29. [Online]. Available: https: //behave.readthedocs.io/en/latest/

2026
[20]

CrewAI documentation,

CrewAI, “CrewAI documentation,” https://docs.crewai.com/, 2024, ac- cessed: 2025-02-15

2024
[21]

Streamlit: A faster way to build and share data apps,

Streamlit Inc., “Streamlit: A faster way to build and share data apps,” https://streamlit.io, 2023, accessed: 2026-02-06

2023
[22]

Swagger petstore,

Swagger API Team, “Swagger petstore,” https://github.com/swagger-api/ swagger-petstore, accessed: December 2025

2025
[23]

Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,

Andrea Giassi, “Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,” https://github.com/andreagiassi/ microservice-rbac-user-management, Accessed: December 2025

2025
[24]

Automated test generation for rest apis: No time to rest yet,

M. Kim, Q. Xin, S. Sinha, and A. Orso, “Automated test generation for rest apis: No time to rest yet,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 289–301

2022
[25]

Armeta - replication package,

“Armeta - replication package,” https://gitlab.abo.fi/virtualseatrial-public/ ARMETA, accessed: 2026-05-26

2026

[1] [1]

Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,

S. Casas, D. Cruz, G. Vidal, and M. Constanzo, “Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,” in2021 40th International Conference of the Chilean Computer Science Society (SCCC). IEEE, 2021, pp. 1–8

2021

[2] [2]

Metamorphic testing: a new approach for generating next test cases,

T. Y . Chenet al., “Metamorphic testing: a new approach for generating next test cases,”arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002

[3] [3]

A new method for constructing metamorphic relations,

H. Liuet al., “A new method for constructing metamorphic relations,” in12th International Conference on Quality Software. IEEE, 2012, pp. 59–68

2012

[4] [4]

Metamorphic testing for web services: Framework and a case study,

C.-A. Sun, G. Wang, B. Mu, H. Liu, Z. Wang, and T. Y . Chen, “Metamorphic testing for web services: Framework and a case study,” in 2011 IEEE international Conference on Web Services. IEEE, 2011, pp. 283–290

2011

[5] [5]

Metamorphic testing of restful web apis,

S. Segura, J. Parejo, J. Troya, and A. Ruiz-Cort ´es, “Metamorphic testing of restful web apis,”IEEE Transactions on Software Engineering, vol. PP, pp. 1–1, 10 2017

2017

[6] [6]

Metamorphic testing: A review of challenges and opportunities,

T. Y . Chen, F. C. Kuo, H. Liu, P. L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–27, 2018

2018

[7] [7]

A survey on metamorphic testing,

S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,”IEEE Transactions on Software Engineering, vol. 42, no. 9, pp. 805–824, 2016

2016

[8] [8]

Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,

Q. H. Luu, H. Liu, and T. Y . Chen, “Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,”arXiv preprint arXiv:2310.19204, 2023

work page arXiv 2023

[9] [9]

Automated metamorphic-relation generation with chatgpt,

Y . Zhang, D. Towey, and M. Pike, “Automated metamorphic-relation generation with chatgpt,” inProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2023, pp. 1–6

2023

[10] [10]

Towards generating executable metamorphic relations using large language models,

S. Y . Shin, F. Pastore, D. Bianculli, and A. Baicoianu, “Towards generating executable metamorphic relations using large language models,” inInternational Conference on the Quality of Information and Communications Technology. Springer, 2024, pp. 126–141

2024

[11] [11]

Behave: Behavior-driven development (bdd) framework for python,

B. Rice and R. Jones, “Behave: Behavior-driven development (bdd) framework for python,” https://pypi.org/project/behave/, Python Package Index (PyPI), 2025, version 1.3.3

2025

[12] [12]

Streamlit documentation,

“Streamlit documentation,” https://docs.streamlit.io/, Accessed: December 2025

2025

[13] [13]

Testing restful apis: A survey,

A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing restful apis: A survey,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 1, pp. 1–41, 2023

2023

[14] [14]

Logiagent: Automated logical testing for rest systems with llm-based multi-agents,

K. Zhanget al., “Logiagent: Automated logical testing for rest systems with llm-based multi-agents,”arXiv preprint arXiv:2503.15079, 2025

work page arXiv 2025

[15] [15]

Restifai: Llm-based workflow for reusable rest api testing,

L. Kogler, M. Ehrhart, B. Dornauer, and E. P. Enoiu, “Restifai: Llm-based workflow for reusable rest api testing,”arXiv preprint arXiv:2512.08706, 2025

work page arXiv 2025

[16] [16]

Liang et al

L. Liang, C. Tan, Y . Deng, Y . Cai, T. Y . Chen, and X. Zheng, “Automt: A multi-agent llm framework for automated metamorphic testing of autonomous driving systems,”arXiv preprint arXiv:2510.19438v1, 2025

work page arXiv 2025

[17] [17]

Atil et al

B. Atil, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, G. R. Rajagopal, A. Sloan, T. Tudrej, F. Tureet al., “Non-determinism of ”deterministic” llm settings,”arXiv preprint arXiv:2408.04667, 2024

work page arXiv 2024

[18] [18]

Wynne and A

M. Wynne and A. Hellesoy,”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

2012

[19] [19]

[Online]

Behave Development Team,Behave Documentation, Read the Docs, 2026, accessed: 2026-01-29. [Online]. Available: https: //behave.readthedocs.io/en/latest/

2026

[20] [20]

CrewAI documentation,

CrewAI, “CrewAI documentation,” https://docs.crewai.com/, 2024, ac- cessed: 2025-02-15

2024

[21] [21]

Streamlit: A faster way to build and share data apps,

Streamlit Inc., “Streamlit: A faster way to build and share data apps,” https://streamlit.io, 2023, accessed: 2026-02-06

2023

[22] [22]

Swagger petstore,

Swagger API Team, “Swagger petstore,” https://github.com/swagger-api/ swagger-petstore, accessed: December 2025

2025

[23] [23]

Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,

Andrea Giassi, “Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,” https://github.com/andreagiassi/ microservice-rbac-user-management, Accessed: December 2025

2025

[24] [24]

Automated test generation for rest apis: No time to rest yet,

M. Kim, Q. Xin, S. Sinha, and A. Orso, “Automated test generation for rest apis: No time to rest yet,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 289–301

2022

[25] [25]

Armeta - replication package,

“Armeta - replication package,” https://gitlab.abo.fi/virtualseatrial-public/ ARMETA, accessed: 2026-05-26

2026